CI: run all system-tests scenarios on PRs by p-datadog · Pull Request #5783 · DataDog/dd-trace-rb

p-datadog · 2026-05-19T20:26:06Z

What does this PR do?

Set skip_empty_scenarios: false unconditionally in .github/workflows/system-tests.yml so PRs exercise every System Tests scenario, not only push-to-master / schedule events.

Motivation:

Today, line 90 toggles skip_empty_scenarios to true for PRs as a CI-cost optimization. When every test in a scenario is marked missing_feature (or bug / skip / xfail) and only an @auxiliary_test such as tests/schemas/test_schemas.py::Test_DdtraceSchemas::test_library would remain, system-tests' conftest.py empties the item list (items[:] = [], conftest.py:365-367) and the scenario boot is short-circuited entirely. The schema validation never runs on PRs.

The result: latent payload-schema bugs land on master undetected. Recent example — PR #5717 merged green; the same code immediately failed the DEBUGGER_SYMDB schema check on master with 'type' is a required property on instance ... symbols[N]. Fixed in PR #5775. Investigation: confirmed with @charles (devX) that this exact PR/push divergence is the cause.

Removing the optimization closes that hole at moderate CI cost.

Cost — measured on PR #5717's diff

Compared the System Tests workflow runs at the same commit content: PR run 26038527628 (skip_empty_scenarios: true) vs. master run 26044854192 (skip_empty_scenarios: false).

Headline

Metric	PR (skip=true)	Master (skip=false)	Δ
Wall-clock (first job → last job)	20.0 min	21.2 min	+1.2 min (+6%)
Total job-seconds (sum across all ~280 parallel jobs)	78,063 s = 1,301 min	89,945 s = 1,499 min	+11,882 s = +198 min (+15%)
Job count	282	281	-1

Where the cost lives — `End-to-end #2` batch

The #2 batch carries the DEBUGGER_SYMDB scenario (per #6962 verification). Same containers, same code, 4–5× longer when the scenario isn't suppressed:

Job	PR	Master	Δ
`End-to-end #2 / rack 2`	116s	605s	+489s
`End-to-end #2 / rails42 2`	105s	587s	+482s
`End-to-end #2 / rails52 2`	130s	578s	+448s
`End-to-end #2 / rails61 2`	131s	628s	+497s
`End-to-end #2 / rails72 2`	139s	597s	+458s
`End-to-end #2 / rails80 2`	110s	613s	+503s
`End-to-end #2 / sinatra14 2`	99s	627s	+528s
`End-to-end #2 / sinatra22 2`	128s	609s	+481s
`End-to-end #2 / sinatra32 2`	109s	582s	+473s
`End-to-end #2 / sinatra41 2`	125s	588s	+463s
`End-to-end #2 / uds-rails 2`	115s	593s	+478s
`End-to-end #2 / uds-sinatra 2`	112s	617s	+505s
#2 batch subtotal	1,419 s	7,224 s	+5,805 s ≈ +97 min (~½ of total Δ)

The remaining ~100 min of delta is spread across batches #3, #6, #7, #9, #12, #14 — every batch that contained a previously-suppressed scenario on the PR side.

The actually-failing jobs ran in essentially the same wall time

Schema check selection added one test per job, costing seconds, not minutes:

Job	PR	Master	Δ
`End-to-end #10 / uds-rails 10`	264s	262s	-2s
`End-to-end #19 / rails72 19`	367s	382s	+15s
`End-to-end #15 / rails80 15`	424s	418s	-6s

Bottom line

Flipping the toggle for PRs costs about +1 minute of wall-clock and +200 minutes of runner-time per System Tests run. Cheap on developer feedback latency, ~15% more billable runner minutes per System Tests invocation. Auxiliary schema checks now actually run on PRs.

Change log entry

None.

Additional Notes:

The other comparable lever (option B from the investigation) is to keep this toggle and instead ensure every Ruby scenario has at least one non-missing_feature test in manifests/ruby.yml. That works case-by-case but reopens the same trap for every future scenario whose only fired-in-runtime artifact is upload payloads (e.g., the next debugger/AppSec/RUM upload). This PR closes the trap by default.
Other tracers keep the optimization; this change is dd-trace-rb-only.

How to test the change?

This PR's own System Tests run will execute every scenario (no skip_empty_scenarios-driven deselection). Watch wall-clock and the schema-check phase.

datadog-datadog-prod-us1-2 · 2026-05-19T20:45:41Z

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 97.09% (-0.01%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 353c536 | Docs | Datadog PR Page | Give us feedback!}

Set `skip_empty_scenarios: false` unconditionally instead of only on push-to-master / schedule events. PRs now exercise every scenario, so auxiliary checks (e.g. `Test_DdtraceSchemas::test_library`) catch schema regressions before merge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cbeauchesne

Address review comment from @cbeauchesne: the called workflow's input has default: false, so the explicit value is unnecessary. Verified against DataDog/system-tests workflow at the pinned ref (7711306e344b3cfb843eae940774d2334c434e86): the skip_empty_scenarios input is declared with default: false. Behavior is unchanged. - Removed in .github/workflows/system-tests.yml:90 Verified yamllint --strict passes. actionlint not available locally.

pr-commenter · 2026-05-27T21:29:49Z

Benchmarks

Benchmark execution time: 2026-05-27 21:29:46

Comparing candidate commit 353c536 in PR branch ci/run-all-system-tests-on-prs with baseline commit f87530c in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 45 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

p-datadog added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label May 19, 2026

dd-octo-sts Bot added the dev/github Github repository maintenance and automation label May 19, 2026

cbeauchesne reviewed May 20, 2026

View reviewed changes

Comment thread .github/workflows/system-tests.yml Outdated

cbeauchesne approved these changes May 20, 2026

View reviewed changes

p-datadog marked this pull request as ready for review May 20, 2026 13:12

p-datadog requested a review from a team as a code owner May 20, 2026 13:12

p-datadog force-pushed the ci/run-all-system-tests-on-prs branch from cc408fd to 0839782 Compare May 20, 2026 17:35

vpellan approved these changes May 21, 2026

View reviewed changes

p-ddsign and others added 2 commits May 27, 2026 17:03

p-datadog force-pushed the ci/run-all-system-tests-on-prs branch from 0839782 to 353c536 Compare May 27, 2026 21:04

p-datadog merged commit a5059f2 into master May 28, 2026
585 checks passed

p-datadog deleted the ci/run-all-system-tests-on-prs branch May 28, 2026 17:31

dd-octo-sts Bot added this to the 2.35.0 milestone May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: run all system-tests scenarios on PRs#5783

CI: run all system-tests scenarios on PRs#5783
p-datadog merged 2 commits into
masterfrom
ci/run-all-system-tests-on-prs

p-datadog commented May 19, 2026

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

pr-commenter Bot commented May 27, 2026

Explanation

More details about the CI and significant changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

p-datadog commented May 19, 2026

Cost — measured on PR #5717's diff

Headline

Where the cost lives — End-to-end #2 batch

The actually-failing jobs ran in essentially the same wall time

Bottom line

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pr-commenter Bot commented May 27, 2026

Benchmarks

Explanation

More details about the CI and significant changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Where the cost lives — `End-to-end #2` batch

datadog-datadog-prod-us1-2 Bot commented May 19, 2026 •

edited

Loading