Skip to content

[Test] Improve debuggability, stability and coverage of test_dcv_configuration and test_dcv_remote_access#7322

Open
gmarciani wants to merge 3 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3160/test-dcv-0325-1
Open

[Test] Improve debuggability, stability and coverage of test_dcv_configuration and test_dcv_remote_access#7322
gmarciani wants to merge 3 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3160/test-dcv-0325-1

Conversation

@gmarciani
Copy link
Copy Markdown
Contributor

@gmarciani gmarciani commented Mar 30, 2026

Description of changes

Improve stability, debuggability and coverage of test_dcv_configuration and test_dcv_remote_access

  • debuggability: retrieve, print and analyze a comprehensive report of crashes (not only the crash filename, but the stack trace of the crash). Also, moved from hard assertions to soft assertions to have a final report of all the observed failures.
  • stability: prevent false positive failures, by ignoring harmless crashes related to gnome, unrelated to nvidia or dcv. Also fixed a gap that was causing failures when multiple instances of this test are executed in parallel by serializing the modifications to ssh known_hosts.
  • coverage: the test is now able to detect crashes on all supported OSs, not only Ubuntu.

Tests

  1. test_dcv_configuration succeeded on all supported OSs on both g5g.2xlarge and c5.xlarge
  2. ONGOING test_dcv_remote_access

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch from 9816039 to 8add2d0 Compare March 30, 2026 22:32
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.08%. Comparing base (0a83ff2) to head (8add2d0).
⚠️ Report is 6 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #7322   +/-   ##
========================================
  Coverage    90.08%   90.08%           
========================================
  Files          182      182           
  Lines        16730    16730           
========================================
  Hits         15071    15071           
  Misses        1659     1659           
Flag Coverage Δ
unittests 90.08% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gmarciani gmarciani changed the title DRAFT [Test] Make test_dcv_configuration able to print out the conten… DRAFT [Test] Improve DCV test crash file detection with cross-OS support Mar 31, 2026
@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x Test labels Mar 31, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch 2 times, most recently from 24f0bcb to 301e69a Compare March 31, 2026 16:29
@gmarciani gmarciani changed the title DRAFT [Test] Improve DCV test crash file detection with cross-OS support [Test] Improve DCV test crash file detection with cross-OS support Mar 31, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch 5 times, most recently from 1b91b14 to 48f5768 Compare April 3, 2026 13:43
@gmarciani gmarciani changed the title [Test] Improve DCV test crash file detection with cross-OS support [Test] Improve stability, debuggability and coverage of test_dcv_configuration Apr 3, 2026
@gmarciani gmarciani changed the title [Test] Improve stability, debuggability and coverage of test_dcv_configuration [Test] Improve debuggability, stability and coverage of test_dcv_configuration Apr 3, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch from 48f5768 to b89788c Compare April 3, 2026 13:54
@gmarciani gmarciani marked this pull request as ready for review April 3, 2026 13:54
@gmarciani gmarciani requested review from a team as code owners April 3, 2026 13:54
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch 3 times, most recently from de5798c to 7b54217 Compare April 3, 2026 14:20
@gmarciani gmarciani changed the title [Test] Improve debuggability, stability and coverage of test_dcv_configuration [Test] Improve debuggability, stability and coverage of test_dcv_configuration and test_dcv_remote_access Apr 3, 2026
…figuration` and `text_dcv_remote_access`:

  * debuggability: retrieve, print and analyze a comprehensive report of crashes (not only the crash filename, but the stack trace of the crash).  Also, moved from hard assertions to soft assertions to have a final report of all the observed failures.
  * stability: prevent false positive failures, by ignoring harmless crashes related to gnome, unrelated to nvidia or dcv. Also fixed a gap that was causing failures when multiple instances of this test are executed in parallel by serializing the modifications to ssh known_hosts.
  * coverage: the test is now able to detect crashes on all supported OSs, not only Ubuntu.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch from 7b54217 to e5f9c88 Compare April 3, 2026 14:39
@himani2411
Copy link
Copy Markdown
Contributor

himani2411 commented Apr 3, 2026

[Non-blocking] Is it possible to keep this crash report irrespective of the DCV test i.e can we diagnose all the integ test clusters to run this crash report irrespective of which test we are running?

This can generate false positives (some of which you have already identified) during our test runs. The advantage i see is that we will be aware of other components which could cause crashes in our product.

This same script can be extended to detect Slurm related Coredumps (/var/log, /var/spool/slurmd, /var/tmp/ or var/spool/abrt) and generate Backtrace

[1] https://slurm.schedmd.com/faq.html#backtrace
[2] https://slurm.schedmd.com/faq.html#core_dump

This can be de-scoped to another PR.

Returns an empty dict if no crashes found.
"""
result = remote_command_executor.run_remote_script(str(DIAGNOSIS_SCRIPT_DIR / "get_crash_report.sh"), pty=False)
return json.loads(result.stdout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap json.loads in a try/except and provide a meaningful error message including the raw stdout so that if there is a failure due to the script we know what the error is?

"""Add SSH host keys for hostname, yield, then remove them. Serialized via file lock across processes."""
lock_file = host_keys_file + ".lock"
with open(lock_file, "w") as lf:
fcntl.flock(lf, fcntl.LOCK_EX)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding a lock on a file which could be written by other test processes? Do we know any other tests which modifies this file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs Test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants