Modify SqlStreamingXml XmlWriter to internally use direct XmlReader parsing#3974
Modify SqlStreamingXml XmlWriter to internally use direct XmlReader parsing#3974jimhblythe wants to merge 7 commits into
Conversation
…tead of a StringBuilder. (dotnet#1877) Note: UTF8Encoding(false) addition in s_writerSettings is consistent with prior default used within StringWriter/StringBuilder
|
@dotnet-policy-service agree |
…ple elements Enhance comments within SqlStreamingXml Extend Manual tests to fully cover GetChars WriteXmlElement includes uncovered paths not accessible for SQL XML column types which normalize Whitespace, CDATA, EntityReference, XmlDeclaration, ProcessingInstruction, DocumentType, and Comment node types
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
paulmedynski
left a comment
There was a problem hiding this comment.
Looks great - thanks for this optimization and expanded test coverage! Just one question.
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3974 +/- ##
==========================================
- Coverage 75.22% 65.82% -9.40%
==========================================
Files 266 275 +9
Lines 42932 65896 +22964
==========================================
+ Hits 32294 43379 +11085
- Misses 10638 22517 +11879
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add tests that verify SqlDataReader.GetChars correctly handles non-ASCII characters when reading XML columns with SequentialAccess: - GetChars_NonAsciiContent: Latin accented characters (2-byte UTF-8) - GetChars_NonAsciiContent_BulkRead: Bulk read path with accented chars - GetChars_CjkContent: CJK characters (3-byte UTF-8) These tests establish a baseline for correct behavior on main before PR #3974 (issue #1877) refactors SqlStreamingXml internals.
|
@jimhblythe can you take a look at the tests in #4005 and #4008? They seem to suggest that this introduces a regression for non-ascii characters. You can check the pipeline results to see the test outcomes. I'd like to see some tests like that included in this PR as well, please 🙏 |
I will additionally including a fourth test to verify surrogate pairs:
The test shows these worked with prior code. |
…reamingXml, reconstructing XML fragments as strings and streaming them char-by-char. Improves efficiency, reduces allocations, and fixes non-ASCII and surrogate pairs. Add comprehensive unit tests for XML edge cases (non-ASCII, surrogate pairs, comments, CDATA, attributes, namespaces, etc.). Refactor existing tests for clarity and better handling of disposables. (AI assist using ChatGPT to better consider edge cases)
|
Regarding this from my last comment: Is exact replication of prior logic desired or more consistent with how SSMS would show an XML column? I am reworking the logic and tests cases such that GetChars responds the same when streaming an Xml typed column as it does when reading Convert(NVarChar(max), xmlColumn) - each test case will verify reading as both data types. |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
…to nvarchar data type. Includes 8 notable tests where GetChars returns different output.
Nvarchar skipped tests summarized as these 3 differences:
* transformation encodes space as ' ' (possibly a bug in the nvarchar handling)
* space prior to the self-closing '/>' (an existing difference)
* nvarchar returns the length of the entire field; xml short circuits and returns -1 (an existing difference; possibly a bug in xml handling compared to method documentation)
Regression to XmlWriter/StringBuilder version of SqlStreamingXml
(that is, these are expected failures for the xml data type, but pass with latest code)
* GetChars_CDATASection
Expected: "<data>some <encoded> content</data>"
Actual: "<data>some <encoded> content<"
* GetChars_EntityReferences_Normalized
Expected: "<>&"'"
Actual: "<&"
* GetChars_WhitespaceAndSignificantWhitespace
Expected: "<root xml:space="preserve"> \t\n </root>"
Actual: "<root xml:space="preserve"> \t\r\n </root"
* Linear_SingleNode
Execution time did not follow linear scale: 1MB=834.1105ms vs. 5MB=10184.0527ms
|
@mdaigle & @paulmedynski, I appreciate the time required to review and I wanted to do more self applied due diligence in identifying the regression deltas. It occurred to meet that the GetChars for an xml data type should act very similar to the nvarchar data type rather than how SSMS display the xml. I extended the unit tests to better document xml handling compared to nvarchar. This includes 8 notable tests where GetChars returns different output. Nvarchar skipped tests summarized as these 3 differences:
Regression to prior XmlWriter/StringBuilder version of SqlStreamingXml
|
|
It might be worth seeing if the SMSS team have requirements that those things be stable, that's be an internal MS outreach. |
For quick reference of some of the SSMS (v22.3.0) handling: Less than/Greater than: this one might need a change in the new logic; AI suggests SQL should return But the use within CDATA is more questionable as my understanding is it should not encode but does. Since SQL Server normalizes out the CDATA tag, I think this has to be consistent with above. Linefeed: should not add Char(13) to the Char(10) |
|
@paulmedynski, @mdaigle , @Wraith2 |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Hey @jimhblythe - Nothing further at the moment. We're just tied up with other tasks. I have tentatively put this into the next preview release, but it looks like a substantial change to review. It may get bumped to preview 2. |
benrr101
left a comment
There was a problem hiding this comment.
Thanks for the submission - this would be a great performance improvement to take!
Definitely want to see the tests rewritten as unit tests, and I'd really like to avoid rewriting xml entities escaping code. But feel free to push back if I misunderstood something - just be prepared to document why it's being done that way :)
| /// corresponding entity references.</param> | ||
| /// <returns>A string with special XML characters replaced by their corresponding entity references. If the input string | ||
| /// is null or empty, an empty string is returned.</returns> | ||
| private string EscapeAttribute(string value) |
There was a problem hiding this comment.
Rewriting code that does escaping is always a code smell to me - there's gotta be a built-in library to handle this. Or is there a really really good reason to write our own?
There was a problem hiding this comment.
XmlWriter is the logical alternative, but was excluded due to excess string allocations. This version is also more specific to SQL Server XML handling.
| private bool TryReadNextChar(out char c) | ||
| { | ||
| // Deliver pending high surrogate first | ||
| if (_pendingHighSurrogate.HasValue) |
There was a problem hiding this comment.
This seems really low-level and likely to be handled by some built-in library, right? It feels like we shouldn't have to deal with this - something in Encoding should be able to handle it?
There was a problem hiding this comment.
The edge case deals with GetChars being called with a boundary splitting a surrogate pair. Encoding libraries would expect to always get both chars of the pair.
| /// <summary> | ||
| /// Escapes special characters in the provided string to ensure it is safe for use in XML attributes. | ||
| /// </summary> | ||
| /// <remarks><![CDATA[ |
There was a problem hiding this comment.
No need for a CDATA block here. These are private methods, we don't need perfectly formatted xmldocs here.
There was a problem hiding this comment.
CDATA is required to prevent CS1570 since < is not escaped
|
|
||
| namespace Microsoft.Data.SqlClient.ManualTesting.Tests | ||
| { | ||
| public static class SqlStreamingXmlTest |
There was a problem hiding this comment.
I appreciate the attention to creating a bunch of new tests for this. But ... I'm gonna have to ask that these be reworked to be unit tests, if at all possible. The SqlStream class is clean enough that it shouldn't be necessary to make connections to a live server to test its behavior.
There was a problem hiding this comment.
@benrr101 I will review options for this as it is a sound goal. On the surface it would seem that this is just xml vs. string handling, but SqlStreamingXml handles nuances presented by SQL Server itself modifying the stream (current understanding anyway). This is evidenced by testing columns as nvarchar vs xml SQL data types.
Many of the tests use SqlDbType.Xml as a SqlParameter so there might be a unit test approach if the delta in handling is not in SQL Server itself. I will modify tests to see whether I can replicate without a connection.
| /// Parameterize test data type scenarios using the value "xml". | ||
| /// This ensures that GetChars method for XML only behaves consistently. | ||
| /// </summary> | ||
| public static TheoryData<string> TheoryData_DataType_XML_Only => new() |
There was a problem hiding this comment.
What's the benefit of having a theory with only one case?
There was a problem hiding this comment.
The overall testing within SqlStreamingXmlTest using TheoryData_DataType was to fully compare XML and nvarchar(max) behavior.
There are 8 exceptions where XML returned via nvarchar deviates. I choose to keep the Theory approach so I could split the test and document with Skip verbiage - for example:
[MemberData(nameof(TheoryData_DataType_XML_Only))]
[MemberData(nameof(TheoryData_DataType_NVarChar_Only), Skip = "Skip: The buffer is not required for nvarchar data type where it returns the length of the entire field.")]


Modify SqlStreamingXml XmlWriter to internally use a MemoryStream instead of a StringBuilder.
Note: UTF8Encoding(false) addition in s_writerSettings is consistent with prior default used within StringWriter/StringBuilder
Issues
Fixes #1877 to be O(n)
Testing
Added 2 new Manual tests to ensure linear behavior for single large node, and secondary validation for multiple nodes
