[Test] Improve debuggability, stability and coverage of test_dcv_configuration and test_dcv_remote_access#7322
Conversation
9816039 to
8add2d0
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7322 +/- ##
========================================
Coverage 90.08% 90.08%
========================================
Files 182 182
Lines 16730 16730
========================================
Hits 15071 15071
Misses 1659 1659
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
24f0bcb to
301e69a
Compare
1b91b14 to
48f5768
Compare
test_dcv_configuration
test_dcv_configurationtest_dcv_configuration
48f5768 to
b89788c
Compare
de5798c to
7b54217
Compare
test_dcv_configurationtest_dcv_configuration and test_dcv_remote_access
…figuration` and `text_dcv_remote_access`: * debuggability: retrieve, print and analyze a comprehensive report of crashes (not only the crash filename, but the stack trace of the crash). Also, moved from hard assertions to soft assertions to have a final report of all the observed failures. * stability: prevent false positive failures, by ignoring harmless crashes related to gnome, unrelated to nvidia or dcv. Also fixed a gap that was causing failures when multiple instances of this test are executed in parallel by serializing the modifications to ssh known_hosts. * coverage: the test is now able to detect crashes on all supported OSs, not only Ubuntu.
7b54217 to
e5f9c88
Compare
|
[Non-blocking] Is it possible to keep this crash report irrespective of the DCV test i.e can we diagnose all the integ test clusters to run this crash report irrespective of which test we are running? This can generate false positives (some of which you have already identified) during our test runs. The advantage i see is that we will be aware of other components which could cause crashes in our product. This same script can be extended to detect Slurm related Coredumps (/var/log, /var/spool/slurmd, /var/tmp/ or var/spool/abrt) and generate Backtrace [1] https://slurm.schedmd.com/faq.html#backtrace This can be de-scoped to another PR. |
…on on known_hosts file by keeping the lock file.
| Returns an empty dict if no crashes found. | ||
| """ | ||
| result = remote_command_executor.run_remote_script(str(DIAGNOSIS_SCRIPT_DIR / "get_crash_report.sh"), pty=False) | ||
| return json.loads(result.stdout) |
There was a problem hiding this comment.
Can we wrap json.loads in a try/except and provide a meaningful error message including the raw stdout so that if there is a failure due to the script we know what the error is?
| """Add SSH host keys for hostname, yield, then remove them. Serialized via file lock across processes.""" | ||
| lock_file = host_keys_file + ".lock" | ||
| with open(lock_file, "w") as lf: | ||
| fcntl.flock(lf, fcntl.LOCK_EX) |
There was a problem hiding this comment.
We are adding a lock on a file which could be written by other test processes? Do we know any other tests which modifies this file?
Description of changes
Improve stability, debuggability and coverage of
test_dcv_configurationandtest_dcv_remote_accessTests
test_dcv_configurationsucceeded on all supported OSs on both g5g.2xlarge and c5.xlargetest_dcv_remote_accessBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.