Skip to content

CI: run all system-tests scenarios on PRs#5783

Merged
p-datadog merged 2 commits into
masterfrom
ci/run-all-system-tests-on-prs
May 28, 2026
Merged

CI: run all system-tests scenarios on PRs#5783
p-datadog merged 2 commits into
masterfrom
ci/run-all-system-tests-on-prs

Conversation

@p-datadog
Copy link
Copy Markdown
Member

What does this PR do?

Set skip_empty_scenarios: false unconditionally in .github/workflows/system-tests.yml so PRs exercise every System Tests scenario, not only push-to-master / schedule events.

Motivation:

Today, line 90 toggles skip_empty_scenarios to true for PRs as a CI-cost optimization. When every test in a scenario is marked missing_feature (or bug / skip / xfail) and only an @auxiliary_test such as tests/schemas/test_schemas.py::Test_DdtraceSchemas::test_library would remain, system-tests' conftest.py empties the item list (items[:] = [], conftest.py:365-367) and the scenario boot is short-circuited entirely. The schema validation never runs on PRs.

The result: latent payload-schema bugs land on master undetected. Recent example — PR #5717 merged green; the same code immediately failed the DEBUGGER_SYMDB schema check on master with 'type' is a required property on instance ... symbols[N]. Fixed in PR #5775. Investigation: confirmed with @charles (devX) that this exact PR/push divergence is the cause.

Removing the optimization closes that hole at moderate CI cost.

Cost — measured on PR #5717's diff

Compared the System Tests workflow runs at the same commit content: PR run 26038527628 (skip_empty_scenarios: true) vs. master run 26044854192 (skip_empty_scenarios: false).

Headline

Metric PR (skip=true) Master (skip=false) Δ
Wall-clock (first job → last job) 20.0 min 21.2 min +1.2 min (+6%)
Total job-seconds (sum across all ~280 parallel jobs) 78,063 s = 1,301 min 89,945 s = 1,499 min +11,882 s = +198 min (+15%)
Job count 282 281 -1

Where the cost lives — End-to-end #2 batch

The #2 batch carries the DEBUGGER_SYMDB scenario (per #6962 verification). Same containers, same code, 4–5× longer when the scenario isn't suppressed:

Job PR Master Δ
End-to-end #2 / rack 2 116s 605s +489s
End-to-end #2 / rails42 2 105s 587s +482s
End-to-end #2 / rails52 2 130s 578s +448s
End-to-end #2 / rails61 2 131s 628s +497s
End-to-end #2 / rails72 2 139s 597s +458s
End-to-end #2 / rails80 2 110s 613s +503s
End-to-end #2 / sinatra14 2 99s 627s +528s
End-to-end #2 / sinatra22 2 128s 609s +481s
End-to-end #2 / sinatra32 2 109s 582s +473s
End-to-end #2 / sinatra41 2 125s 588s +463s
End-to-end #2 / uds-rails 2 115s 593s +478s
End-to-end #2 / uds-sinatra 2 112s 617s +505s
#2 batch subtotal 1,419 s 7,224 s +5,805 s ≈ +97 min (~½ of total Δ)

The remaining ~100 min of delta is spread across batches #3, #6, #7, #9, #12, #14 — every batch that contained a previously-suppressed scenario on the PR side.

The actually-failing jobs ran in essentially the same wall time

Schema check selection added one test per job, costing seconds, not minutes:

Job PR Master Δ
End-to-end #10 / uds-rails 10 264s 262s -2s
End-to-end #19 / rails72 19 367s 382s +15s
End-to-end #15 / rails80 15 424s 418s -6s

Bottom line

Flipping the toggle for PRs costs about +1 minute of wall-clock and +200 minutes of runner-time per System Tests run. Cheap on developer feedback latency, ~15% more billable runner minutes per System Tests invocation. Auxiliary schema checks now actually run on PRs.

Change log entry

None.

Additional Notes:

  • The other comparable lever (option B from the investigation) is to keep this toggle and instead ensure every Ruby scenario has at least one non-missing_feature test in manifests/ruby.yml. That works case-by-case but reopens the same trap for every future scenario whose only fired-in-runtime artifact is upload payloads (e.g., the next debugger/AppSec/RUM upload). This PR closes the trap by default.
  • Other tracers keep the optimization; this change is dd-trace-rb-only.

How to test the change?

  • This PR's own System Tests run will execute every scenario (no skip_empty_scenarios-driven deselection). Watch wall-clock and the schema-check phase.

@p-datadog p-datadog added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label May 19, 2026
@dd-octo-sts dd-octo-sts Bot added the dev/github Github repository maintenance and automation label May 19, 2026
@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 Bot commented May 19, 2026

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 97.09% (-0.01%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 353c536 | Docs | Datadog PR Page | Give us feedback!

Comment thread .github/workflows/system-tests.yml Outdated
@p-datadog p-datadog marked this pull request as ready for review May 20, 2026 13:12
@p-datadog p-datadog requested a review from a team as a code owner May 20, 2026 13:12
@p-datadog p-datadog force-pushed the ci/run-all-system-tests-on-prs branch from cc408fd to 0839782 Compare May 20, 2026 17:35
p-ddsign and others added 2 commits May 27, 2026 17:03
Set `skip_empty_scenarios: false` unconditionally instead of only on
push-to-master / schedule events. PRs now exercise every scenario, so
auxiliary checks (e.g. `Test_DdtraceSchemas::test_library`) catch
schema regressions before merge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address review comment from @cbeauchesne: the called workflow's input
has default: false, so the explicit value is unnecessary.

Verified against DataDog/system-tests workflow at the pinned ref
(7711306e344b3cfb843eae940774d2334c434e86): the skip_empty_scenarios
input is declared with default: false. Behavior is unchanged.

- Removed in .github/workflows/system-tests.yml:90

Verified yamllint --strict passes. actionlint not available locally.
@p-datadog p-datadog force-pushed the ci/run-all-system-tests-on-prs branch from 0839782 to 353c536 Compare May 27, 2026 21:04
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 27, 2026

Benchmarks

Benchmark execution time: 2026-05-27 21:29:46

Comparing candidate commit 353c536 in PR branch ci/run-all-system-tests-on-prs with baseline commit f87530c in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 45 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

@p-datadog p-datadog merged commit a5059f2 into master May 28, 2026
585 checks passed
@p-datadog p-datadog deleted the ci/run-all-system-tests-on-prs branch May 28, 2026 17:31
@dd-octo-sts dd-octo-sts Bot added this to the 2.35.0 milestone May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos dev/github Github repository maintenance and automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants