Tests | Flakiness improvements to XEventsTracingTest#4262
Conversation
* Refactor one test method into three distinct test cases. * Add reasons for the tests being marked as flaky. * Switch call to 'sp_help' to a new, simpler SP. * Switch execution of 'SELECT @@Version' to a simpler 'SELECT 1' statement. * Use new FlushResultSet helper. * Simplify XEvent session name generation. * Add test case to verify that an activity ID is recorded in the extended event when the SQL statement throws an error.
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
That's looking positive for the first run, although it looks like the |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
No deadlocks on the second run, but it looks like there are a few problems with the number of sessions open against a single Azure SQL instance - there's a hard limit of 128MB session memory:
We'd originally increased the MAX_MEMORY from 4MB to 16MB in an attempt to tackle the deadlocks, but perhaps these have been addressed at source. I've cut this back to 4MB, can someone run the pipelines a few times please? |
| $"WITH (" + | ||
| $" {duration} " + | ||
| $" MAX_MEMORY=16 MB," + | ||
| $" MAX_MEMORY=4 MB," + |
There was a problem hiding this comment.
Could we add a one-line comment next to MAX_MEMORY=4 MB explaining why this value matters? The PR description has great context about Azure SQL's 128 MB global cap and why raising it caused parallel-CI failures — but that context lives in the PR, not the code. Without the comment, the next person debugging XEvent buffer pressure will likely bump it back up to "fix" their immediate symptom and re-trigger the original problem. Even a short // Don't raise — Azure SQL caps total XEvent session memory at 128MB. See PR #4262. would protect us.
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
I've added an explanatory comment and pushed, thanks @priyankatiwari08. The last pipeline run passed, so I'd appreciate a few more at peak load before I unmark them as flaky. |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
All tests passed on those two runs, which looks encouraging - particularly since this run had one or two failures which seemed to be an indicator that the remote server is under load. Thanks. Could you run this twice more at peak please? |
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
Description
This performs some reliability improvements to XEventsTracingTest. We can see intermittent failures, most of which are the result of them being killed to resolve deadlocks.
I wouldn't normally expect the original SQL statements (
sp_helpandSELECT @@VERSION) to encounter that, but sp_help returns the list of objects in the current database; perhaps if objects are being created by another CI run, this becomes an issue.To break the dependency on server state, I've replaced both method calls with a call to a new SP which just runs
SELECT 1, and a SQL statement which runsSELECT 1directly.During investigation it also became clear that an activity ID is recorded (and the test is capable of passing) even when a deadlock or other SQL error occurs. I've made this explicit via a new test case. Technically this means that we could simply broaden the error handling to swallow all
SqlExceptionerrors when executing the command and the test would continue to pass. I've not done this because I'm a little more concerned that we're encountering deadlocks on comparatively simple statements, and don't want to mask any underlying issue.Besides this, there are a few QoL improvements:
FlushResultSethelper was added in an earlier PR, and we now use it.Issues
Contributes to #3453.
Testing
One new test case. All three XEvents tests pass, but I don't think I can easily reproduce the same kind of load. Could someone run CI against this PR multiple times at peak load please?