[Test] Improve debuggability, stability and coverage of `test_dcv_configuration` and `test_dcv_remote_access` by gmarciani · Pull Request #7322 · aws/aws-parallelcluster

gmarciani · 2026-03-30T22:30:45Z

Description of changes

Improve stability, debuggability and coverage of test_dcv_configuration and test_dcv_remote_access

debuggability: retrieve, print and analyze a comprehensive report of crashes (not only the crash filename, but the stack trace of the crash). Also, moved from hard assertions to soft assertions to have a final report of all the observed failures.
stability: prevent false positive failures, by ignoring harmless crashes related to gnome, unrelated to nvidia or dcv. Also fixed a gap that was causing failures when multiple instances of this test are executed in parallel by serializing the modifications to ssh known_hosts.
coverage: the test is now able to detect crashes on all supported OSs, not only Ubuntu.

Tests

test_dcv_configuration succeeded on all supported OSs on both g5g.2xlarge and c5.xlarge
ONGOING test_dcv_remote_access

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov · 2026-03-30T22:38:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.08%. Comparing base (0a83ff2) to head (8add2d0).
⚠️ Report is 6 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #7322   +/-   ##
========================================
  Coverage    90.08%   90.08%           
========================================
  Files          182      182           
  Lines        16730    16730           
========================================
  Hits         15071    15071           
  Misses        1659     1659

Flag	Coverage Δ
unittests	`90.08% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…figuration` and `text_dcv_remote_access`: * debuggability: retrieve, print and analyze a comprehensive report of crashes (not only the crash filename, but the stack trace of the crash). Also, moved from hard assertions to soft assertions to have a final report of all the observed failures. * stability: prevent false positive failures, by ignoring harmless crashes related to gnome, unrelated to nvidia or dcv. Also fixed a gap that was causing failures when multiple instances of this test are executed in parallel by serializing the modifications to ssh known_hosts. * coverage: the test is now able to detect crashes on all supported OSs, not only Ubuntu.

himani2411 · 2026-04-03T17:06:22Z

[Non-blocking] Is it possible to keep this crash report irrespective of the DCV test i.e can we diagnose all the integ test clusters to run this crash report irrespective of which test we are running?

This can generate false positives (some of which you have already identified) during our test runs. The advantage i see is that we will be aware of other components which could cause crashes in our product.

This same script can be extended to detect Slurm related Coredumps (/var/log, /var/spool/slurmd, /var/tmp/ or var/spool/abrt) and generate Backtrace

[1] https://slurm.schedmd.com/faq.html#backtrace
[2] https://slurm.schedmd.com/faq.html#core_dump

This can be de-scoped to another PR.

…on on known_hosts file by keeping the lock file.

himani2411 · 2026-04-03T20:14:56Z

tests/integration-tests/tests/dcv/test_dcv.py

+    Returns an empty dict if no crashes found.
+    """
+    result = remote_command_executor.run_remote_script(str(DIAGNOSIS_SCRIPT_DIR / "get_crash_report.sh"), pty=False)
+    return json.loads(result.stdout)


Can we wrap json.loads in a try/except and provide a meaningful error message including the raw stdout so that if there is a failure due to the script we know what the error is?

himani2411 · 2026-04-03T20:16:38Z

tests/integration-tests/tests/dcv/test_dcv.py

+    """Add SSH host keys for hostname, yield, then remove them. Serialized via file lock across processes."""
+    lock_file = host_keys_file + ".lock"
+    with open(lock_file, "w") as lf:
+        fcntl.flock(lf, fcntl.LOCK_EX)


We are adding a lock on a file which could be written by other test processes? Do we know any other tests which modifies this file?

gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch from 9816039 to 8add2d0 Compare March 30, 2026 22:32

gmarciani changed the title ~~DRAFT [Test] Make test_dcv_configuration able to print out the conten…~~ DRAFT [Test] Improve DCV test crash file detection with cross-OS support Mar 31, 2026

gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x Test labels Mar 31, 2026

gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch 2 times, most recently from 24f0bcb to 301e69a Compare March 31, 2026 16:29

gmarciani changed the title ~~DRAFT [Test] Improve DCV test crash file detection with cross-OS support~~ [Test] Improve DCV test crash file detection with cross-OS support Mar 31, 2026

gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch 5 times, most recently from 1b91b14 to 48f5768 Compare April 3, 2026 13:43

gmarciani changed the title ~~[Test] Improve DCV test crash file detection with cross-OS support~~ [Test] Improve stability, debuggability and coverage of test_dcv_configuration Apr 3, 2026

gmarciani changed the title ~~[Test] Improve stability, debuggability and coverage of test_dcv_configuration~~ [Test] Improve debuggability, stability and coverage of test_dcv_configuration Apr 3, 2026

gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch from 48f5768 to b89788c Compare April 3, 2026 13:54

gmarciani marked this pull request as ready for review April 3, 2026 13:54

gmarciani requested review from a team as code owners April 3, 2026 13:54

gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch 3 times, most recently from de5798c to 7b54217 Compare April 3, 2026 14:20

gmarciani changed the title ~~[Test] Improve debuggability, stability and coverage of test_dcv_configuration~~ [Test] Improve debuggability, stability and coverage of test_dcv_configuration and test_dcv_remote_access Apr 3, 2026

gmarciani force-pushed the wip/mgiacomo/3160/test-dcv-0325-1 branch from 7b54217 to e5f9c88 Compare April 3, 2026 14:39

gmarciani added 2 commits April 3, 2026 13:36

[Test] Extract content of crash file in Ubuntu with Apport module.

0d760d4

[Test] Prevent test failures caused by race conditions on r/w operati…

1674894

…on on known_hosts file by keeping the lock file.

himani2411 reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test] Improve debuggability, stability and coverage of `test_dcv_configuration` and `test_dcv_remote_access`#7322

[Test] Improve debuggability, stability and coverage of `test_dcv_configuration` and `test_dcv_remote_access`#7322
gmarciani wants to merge 3 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3160/test-dcv-0325-1

gmarciani commented Mar 30, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

himani2411 commented Apr 3, 2026 •

edited

Loading

Uh oh!

himani2411 Apr 3, 2026

Uh oh!

himani2411 Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gmarciani commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

Uh oh!

codecov bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

himani2411 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

himani2411 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

himani2411 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmarciani commented Mar 30, 2026 •

edited

Loading

codecov bot commented Mar 30, 2026 •

edited

Loading

himani2411 commented Apr 3, 2026 •

edited

Loading