Skip to content

Tests | Flakiness improvements to XEventsTracingTest#4262

Open
edwardneal wants to merge 3 commits into
dotnet:mainfrom
edwardneal:tests/xevent-reliability
Open

Tests | Flakiness improvements to XEventsTracingTest#4262
edwardneal wants to merge 3 commits into
dotnet:mainfrom
edwardneal:tests/xevent-reliability

Conversation

@edwardneal
Copy link
Copy Markdown
Contributor

Description

This performs some reliability improvements to XEventsTracingTest. We can see intermittent failures, most of which are the result of them being killed to resolve deadlocks.

I wouldn't normally expect the original SQL statements (sp_help and SELECT @@VERSION) to encounter that, but sp_help returns the list of objects in the current database; perhaps if objects are being created by another CI run, this becomes an issue.

To break the dependency on server state, I've replaced both method calls with a call to a new SP which just runs SELECT 1, and a SQL statement which runs SELECT 1 directly.

During investigation it also became clear that an activity ID is recorded (and the test is capable of passing) even when a deadlock or other SQL error occurs. I've made this explicit via a new test case. Technically this means that we could simply broaden the error handling to swallow all SqlException errors when executing the command and the test would continue to pass. I've not done this because I'm a little more concerned that we're encountering deadlocks on comparatively simple statements, and don't want to mask any underlying issue.

Besides this, there are a few QoL improvements:

  • One test method is now refactored into three coherent test cases. This also makes the reflection work to retrieve the test case name unnecessary.
  • Added XML documentation which describes the tests (and the reason why they're marked as flaky.)
  • A new FlushResultSet helper was added in an earlier PR, and we now use it.

Issues

Contributes to #3453.

Testing

One new test case. All three XEvents tests pass, but I don't think I can easily reproduce the same kind of load. Could someone run CI against this PR multiple times at peak load please?

* Refactor one test method into three distinct test cases.
* Add reasons for the tests being marked as flaky.
* Switch call to 'sp_help' to a new, simpler SP.
* Switch execution of 'SELECT @@Version' to a simpler 'SELECT 1' statement.
* Use new FlushResultSet helper.
* Simplify XEvent session name generation.
* Add test case to verify that an activity ID is recorded in the extended event when the SQL statement throws an error.
@edwardneal edwardneal requested a review from a team as a code owner May 7, 2026 04:59
@github-project-automation github-project-automation Bot moved this to To triage in SqlClient Board May 7, 2026
@cheenamalhotra cheenamalhotra moved this from To triage to In review in SqlClient Board May 7, 2026
@cheenamalhotra cheenamalhotra added the Area\Tests Issues that are targeted to tests or test projects label May 7, 2026
@cheenamalhotra cheenamalhotra added this to the 7.1.0-preview2 milestone May 7, 2026
@cheenamalhotra
Copy link
Copy Markdown
Member

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@edwardneal
Copy link
Copy Markdown
Contributor Author

That's looking positive for the first run, although it looks like the LocalAppContextSwitchesTest.TestDefaultAppContextSwitchValues unit test is unrelatedly flaky - possibly because it's running in parallel with other unit tests which change the values. I'd appreciate a few more runs of this.

@mdaigle
Copy link
Copy Markdown
Contributor

mdaigle commented May 12, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@edwardneal
Copy link
Copy Markdown
Contributor Author

No deadlocks on the second run, but it looks like there are a few problems with the number of sessions open against a single Azure SQL instance - there's a hard limit of 128MB session memory:

Microsoft.Data.SqlClient.SqlException : Operation failed. Operation will cause database event session memory to exceed allowed limit. Event session memory may be released by stopping active sessions or altering session memory options. Check sys.dm_xe_database_sessions for active sessions that can be stopped or altered.

We'd originally increased the MAX_MEMORY from 4MB to 16MB in an attempt to tackle the deadlocks, but perhaps these have been addressed at source. I've cut this back to 4MB, can someone run the pipelines a few times please?

$"WITH (" +
$" {duration} " +
$" MAX_MEMORY=16 MB," +
$" MAX_MEMORY=4 MB," +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a one-line comment next to MAX_MEMORY=4 MB explaining why this value matters? The PR description has great context about Azure SQL's 128 MB global cap and why raising it caused parallel-CI failures — but that context lives in the PR, not the code. Without the comment, the next person debugging XEvent buffer pressure will likely bump it back up to "fix" their immediate symptom and re-trigger the original problem. Even a short // Don't raise — Azure SQL caps total XEvent session memory at 128MB. See PR #4262. would protect us.

@priyankatiwari08
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@edwardneal
Copy link
Copy Markdown
Contributor Author

I've added an explanatory comment and pushed, thanks @priyankatiwari08. The last pipeline run passed, so I'd appreciate a few more at peak load before I unmark them as flaky.

@mdaigle
Copy link
Copy Markdown
Contributor

mdaigle commented May 12, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@edwardneal
Copy link
Copy Markdown
Contributor Author

All tests passed on those two runs, which looks encouraging - particularly since this run had one or two failures which seemed to be an indicator that the remote server is under load. Thanks.

Could you run this twice more at peak please?

@mdaigle
Copy link
Copy Markdown
Contributor

mdaigle commented May 12, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area\Tests Issues that are targeted to tests or test projects

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

4 participants