Skip to content

Modify SqlStreamingXml XmlWriter to internally use direct XmlReader parsing#3974

Open
jimhblythe wants to merge 7 commits into
dotnet:mainfrom
jimhblythe:issue-1877
Open

Modify SqlStreamingXml XmlWriter to internally use direct XmlReader parsing#3974
jimhblythe wants to merge 7 commits into
dotnet:mainfrom
jimhblythe:issue-1877

Conversation

@jimhblythe
Copy link
Copy Markdown

Modify SqlStreamingXml XmlWriter to internally use a MemoryStream instead of a StringBuilder.

Note: UTF8Encoding(false) addition in s_writerSettings is consistent with prior default used within StringWriter/StringBuilder

Issues

Fixes #1877 to be O(n)

Testing

Added 2 new Manual tests to ensure linear behavior for single large node, and secondary validation for multiple nodes
image

…tead of a StringBuilder. (dotnet#1877)

Note: UTF8Encoding(false) addition in s_writerSettings is consistent with prior default used within StringWriter/StringBuilder
@jimhblythe
Copy link
Copy Markdown
Author

@dotnet-policy-service agree

Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
@mdaigle mdaigle moved this from To triage to Needs Response in SqlClient Board Feb 24, 2026
…ple elements

Enhance comments within SqlStreamingXml
Extend Manual tests to fully cover GetChars
WriteXmlElement includes uncovered paths not accessible for SQL XML column types which normalize Whitespace, CDATA, EntityReference, XmlDeclaration, ProcessingInstruction, DocumentType, and Comment node types
@paulmedynski
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@paulmedynski paulmedynski self-assigned this Feb 27, 2026
@paulmedynski paulmedynski moved this from Needs Response to In review in SqlClient Board Feb 27, 2026
Copy link
Copy Markdown
Contributor

@paulmedynski paulmedynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for this optimization and expanded test coverage! Just one question.

Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
@paulmedynski
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

paulmedynski
paulmedynski previously approved these changes Feb 27, 2026
@paulmedynski paulmedynski added this to the 7.0.0 milestone Feb 27, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 97.57576% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.82%. Comparing base (a68e00f) to head (9e62ba5).
⚠️ Report is 65 commits behind head on main.

Files with missing lines Patch % Lines
...qlClient/src/Microsoft/Data/SqlClient/SqlStream.cs 97.57% 4 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (a68e00f) and HEAD (9e62ba5). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (a68e00f) HEAD (9e62ba5)
netfx 2 0
netcore 2 0
addons 2 0
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3974      +/-   ##
==========================================
- Coverage   75.22%   65.82%   -9.40%     
==========================================
  Files         266      275       +9     
  Lines       42932    65896   +22964     
==========================================
+ Hits        32294    43379   +11085     
- Misses      10638    22517   +11879     
Flag Coverage Δ
PR-SqlClient-Project 65.82% <97.57%> (?)
addons ?
netcore ?
netfx ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mdaigle added a commit that referenced this pull request Mar 5, 2026
Add tests that verify SqlDataReader.GetChars correctly handles non-ASCII
characters when reading XML columns with SequentialAccess:

- GetChars_NonAsciiContent: Latin accented characters (2-byte UTF-8)
- GetChars_NonAsciiContent_BulkRead: Bulk read path with accented chars
- GetChars_CjkContent: CJK characters (3-byte UTF-8)

These tests establish a baseline for correct behavior on main before
PR #3974 (issue #1877) refactors SqlStreamingXml internals.
@mdaigle
Copy link
Copy Markdown
Contributor

mdaigle commented Mar 5, 2026

@jimhblythe can you take a look at the tests in #4005 and #4008? They seem to suggest that this introduces a regression for non-ascii characters. You can check the pipeline results to see the test outcomes. I'd like to see some tests like that included in this PR as well, please 🙏

@jimhblythe
Copy link
Copy Markdown
Author

@jimhblythe can you take a look at the tests in #4005 and #4008? They seem to suggest that this introduces a regression for non-ascii characters. You can check the pipeline results to see the test outcomes. I'd like to see some tests like that included in this PR as well, please 🙏

I will additionally including a fourth test to verify surrogate pairs:

        [ConditionalFact(typeof(DataTestUtility), nameof(DataTestUtility.AreConnStringsSetup))]
        public static void GetChars_SurrogatePairContent()
        {
            SqlConnection connection = new(DataTestUtility.TCPConnectionString);
            // Surrogate Pair characters: 4 bytes each in UTF-8
            string xml = "<data>\U0001F600\U0001F525\U0001F680</data>";

Actual: "<data>😀🔥🚀</data>"

The test shows these worked with prior code.

…reamingXml, reconstructing XML fragments as strings and streaming them char-by-char. Improves efficiency, reduces allocations, and fixes non-ASCII and surrogate pairs.

Add comprehensive unit tests for XML edge cases (non-ASCII, surrogate pairs, comments, CDATA, attributes, namespaces, etc.). Refactor existing tests for clarity and better handling of disposables.
(AI assist using ChatGPT to better consider edge cases)
@jimhblythe
Copy link
Copy Markdown
Author

jimhblythe commented Mar 8, 2026

I had to use an alternate as MemoryStream was not the right choice. I actually optimized further with bypass of XmlWriter which used an additional buffer; more memory spared with latest commit.

Significantly more tests added :
image

Is exact replication of prior logic desired or more consistent with how SSMS would show an XML column?
Prior logic would occasionally convert < back to &lt; - e.g. within CData
Prior logic would also expand \n to \r\n

Coverage complete except for null reference bullet-proofing that cannot be replicated through SqlReader which ensures values are provide:
image

@paulmedynski paulmedynski modified the milestones: 7.0.0, 7.0.1, 7.1.0-preview1 Mar 9, 2026
@jimhblythe
Copy link
Copy Markdown
Author

Regarding this from my last comment: Is exact replication of prior logic desired or more consistent with how SSMS would show an XML column?

I am reworking the logic and tests cases such that GetChars responds the same when streaming an Xml typed column as it does when reading Convert(NVarChar(max), xmlColumn) - each test case will verify reading as both data types.

@paulmedynski
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

…to nvarchar data type. Includes 8 notable tests where GetChars returns different output.

Nvarchar skipped tests summarized as these 3 differences:
* transformation encodes space as '&#x20;' (possibly a bug in the nvarchar handling)
* space prior to the self-closing '/>' (an existing difference)
* nvarchar returns the length of the entire field; xml short circuits and returns -1 (an existing difference; possibly a bug in xml handling compared to method documentation)

Regression to XmlWriter/StringBuilder version of SqlStreamingXml
(that is, these are expected failures for the xml data type, but pass with latest code)
* GetChars_CDATASection
    Expected: "<data>some <encoded> content</data>"
    Actual:   "<data>some &lt;encoded&gt; content<"
* GetChars_EntityReferences_Normalized
    Expected: "<>&"'"
    Actual:   "&lt;&"
* GetChars_WhitespaceAndSignificantWhitespace
    Expected: "<root xml:space="preserve">  \t\n  </root>"
    Actual:   "<root xml:space="preserve">  \t\r\n  </root"
* Linear_SingleNode
    Execution time did not follow linear scale: 1MB=834.1105ms vs. 5MB=10184.0527ms
@jimhblythe
Copy link
Copy Markdown
Author

@mdaigle & @paulmedynski, I appreciate the time required to review and I wanted to do more self applied due diligence in identifying the regression deltas. It occurred to meet that the GetChars for an xml data type should act very similar to the nvarchar data type rather than how SSMS display the xml. I extended the unit tests to better document xml handling compared to nvarchar. This includes 8 notable tests where GetChars returns different output.

Nvarchar skipped tests summarized as these 3 differences:

  • transformation encodes space as '&#x20;' (possibly a bug in the nvarchar handling)
  • space prior to the self-closing '/>' (an existing difference)
  • nvarchar returns the length of the entire field; xml short circuits and returns -1 (an existing difference; possibly a bug in xml handling compared to method documentation)

Regression to prior XmlWriter/StringBuilder version of SqlStreamingXml
(that is, these are expected failures for the xml data type, but pass with latest code)

  • GetChars_CDATASection
    Expected: "<data>some <encoded> content</data>"
    Actual: "<data>some &lt;encoded&gt; content<"
  • GetChars_EntityReferences_Normalized
    Expected: "<>&"'"
    Actual: "&lt;&"
  • GetChars_WhitespaceAndSignificantWhitespace
    Expected: "<root xml:space="preserve"> \t\n </root>"
    Actual: "<root xml:space="preserve"> \t\r\n </root"
  • Linear_SingleNode
    Execution time did not follow linear scale: 1MB=834.1105ms vs. 5MB=10184.0527ms

@Wraith2
Copy link
Copy Markdown
Contributor

Wraith2 commented Mar 23, 2026

It might be worth seeing if the SMSS team have requirements that those things be stable, that's be an internal MS outreach.

@jimhblythe
Copy link
Copy Markdown
Author

jimhblythe commented Mar 23, 2026

It might be worth seeing if the SMSS team have requirements that those things be stable, that's be an internal MS outreach.

For quick reference of some of the SSMS (v22.3.0) handling:
Self-closing: GetChars is consistent for both nvarchar not including a space and xml including it.

DECLARE @xmlParam xml = N'<data />'
SELECT Convert(nvarchar(max), @xmlParam)
SELECT @xmlParam
<data/>

(1 row affected)

<data />

(1 row affected)

Less than/Greater than: this one might need a change in the new logic; AI suggests SQL should return <> but SSMS is returning the escaped version for both data types.

DECLARE @xmlParam xml = N'<data>&lt;&gt;&amp;&quot;&apos;</data>'
SELECT Convert(nvarchar(max), @xmlParam)
SELECT @xmlParam
<data>&lt;&gt;&amp;"'</data>

(1 row affected)

<data>&lt;&gt;&amp;"'</data>

(1 row affected)

But the use within CDATA is more questionable as my understanding is it should not encode but does. Since SQL Server normalizes out the CDATA tag, I think this has to be consistent with above.

DECLARE @xmlParam xml = N'<data><![CDATA[some <encoded> content]]></data>'
SELECT Convert(nvarchar(max), @xmlParam)
SELECT @xmlParam
<data>some &lt;encoded&gt; content</data>

(1 row affected)

<data>some &lt;encoded&gt; content</data>

(1 row affected)

Linefeed: should not add Char(13) to the Char(10)
(cannot verify xml data type through SSMS without converting to char data type)
(this example also shows the &#x20; but only for the last space - fully materialized for the input column 33 of a space)

DECLARE @xmlParam xml = '<root xml:space="preserve">  ' + NChar(9) + NChar(10) + '  </root>'
SELECT Convert(nvarchar(max), @xmlParam)
SELECT @xmlParam

PRINT Ascii(SubString(Convert(nvarchar(max), @xmlParam), 29, 1))
PRINT Ascii(SubString(Convert(nvarchar(max), @xmlParam), 30, 1))
PRINT Ascii(SubString(Convert(nvarchar(max), @xmlParam), 31, 1))
PRINT Ascii(SubString(Convert(nvarchar(max), @xmlParam), 32, 1))
PRINT SubString(Convert(nvarchar(max), @xmlParam), 33, 6) -- pulling 6 chars
PRINT Ascii(SubString(Convert(nvarchar(max), @xmlParam), 33, 1))
--original string without xml conversion
PRINT Ascii(SubString('<root xml:space="preserve">  ' + NChar(9) + NChar(10) + '  </root>', 33, 1))
<root xml:space="preserve">  	
 &#x20;</root>

(1 row affected)

<root xml:space="preserve">  	
  </root>

(1 row affected)

32
9
10
32
&#x20;
38
32

@jimhblythe jimhblythe changed the title Modify SqlStreamingXml XmlWriter to internally use a MemoryStream Modify SqlStreamingXml XmlWriter to internally use direct XmlReader parsing Mar 24, 2026
@jimhblythe
Copy link
Copy Markdown
Author

@paulmedynski, @mdaigle , @Wraith2
Being new to this repo, I am not sure if this is just pending final review or if there is further action I should take.
If there is additional action for me, please let me know - thanks

@paulmedynski
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@paulmedynski
Copy link
Copy Markdown
Contributor

Hey @jimhblythe - Nothing further at the moment. We're just tied up with other tasks. I have tentatively put this into the next preview release, but it looks like a substantial change to review. It may get bumped to preview 2.

Copy link
Copy Markdown
Contributor

@benrr101 benrr101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission - this would be a great performance improvement to take!

Definitely want to see the tests rewritten as unit tests, and I'd really like to avoid rewriting xml entities escaping code. But feel free to push back if I misunderstood something - just be prepared to document why it's being done that way :)

Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
/// corresponding entity references.</param>
/// <returns>A string with special XML characters replaced by their corresponding entity references. If the input string
/// is null or empty, an empty string is returned.</returns>
private string EscapeAttribute(string value)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewriting code that does escaping is always a code smell to me - there's gotta be a built-in library to handle this. Or is there a really really good reason to write our own?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XmlWriter is the logical alternative, but was excluded due to excess string allocations. This version is also more specific to SQL Server XML handling.

private bool TryReadNextChar(out char c)
{
// Deliver pending high surrogate first
if (_pendingHighSurrogate.HasValue)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really low-level and likely to be handled by some built-in library, right? It feels like we shouldn't have to deal with this - something in Encoding should be able to handle it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edge case deals with GetChars being called with a boundary splitting a surrogate pair. Encoding libraries would expect to always get both chars of the pair.

Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
/// <summary>
/// Escapes special characters in the provided string to ensure it is safe for use in XML attributes.
/// </summary>
/// <remarks><![CDATA[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a CDATA block here. These are private methods, we don't need perfectly formatted xmldocs here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CDATA is required to prevent CS1570 since < is not escaped


namespace Microsoft.Data.SqlClient.ManualTesting.Tests
{
public static class SqlStreamingXmlTest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the attention to creating a bunch of new tests for this. But ... I'm gonna have to ask that these be reworked to be unit tests, if at all possible. The SqlStream class is clean enough that it shouldn't be necessary to make connections to a live server to test its behavior.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benrr101 I will review options for this as it is a sound goal. On the surface it would seem that this is just xml vs. string handling, but SqlStreamingXml handles nuances presented by SQL Server itself modifying the stream (current understanding anyway). This is evidenced by testing columns as nvarchar vs xml SQL data types.

Many of the tests use SqlDbType.Xml as a SqlParameter so there might be a unit test approach if the delta in handling is not in SQL Server itself. I will modify tests to see whether I can replicate without a connection.

/// Parameterize test data type scenarios using the value "xml".
/// This ensures that GetChars method for XML only behaves consistently.
/// </summary>
public static TheoryData<string> TheoryData_DataType_XML_Only => new()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of having a theory with only one case?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall testing within SqlStreamingXmlTest using TheoryData_DataType was to fully compare XML and nvarchar(max) behavior.

There are 8 exceptions where XML returned via nvarchar deviates. I choose to keep the Theory approach so I could split the test and document with Skip verbiage - for example:

        [MemberData(nameof(TheoryData_DataType_XML_Only))]
        [MemberData(nameof(TheoryData_DataType_NVarChar_Only), Skip = "Skip: The buffer is not required for nvarchar data type where it returns the length of the entire field.")]
image

@github-project-automation github-project-automation Bot moved this from In review to In progress in SqlClient Board May 12, 2026
Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
Comment thread src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlStream.cs Outdated
Copy link
Copy Markdown
Contributor

@priyankatiwari08 priyankatiwari08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the perf work on #1877 — algorithmic fix seems to be in the right direction. Agree with @benrr101's asks; nothing material to add on top of his review, just two small nits below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

O(N^2) performance when reading XML with SequentialAccess

7 participants