bugfix: make restart deterministic and fix checkpoint state completeness#72
Conversation
8a52b5e to
e635cef
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e635cef51f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR improves restart reproducibility by expanding checkpoint/save state coverage, fixing restart initialization/order dependencies, making marker replenishment deterministic, preventing output overwrites on same-modelname restarts, and adding CI coverage to validate fresh-vs-restart bitwise equivalence.
Changes:
- Extend checkpointed scalar/state coverage and fix restart read/compute ordering so restarted runs match uninterrupted runs.
- Make marker replenishment deterministic via seeded RNG, and persist/read additional physics fields (e.g., viscosity) to avoid recomputation drift.
- Add a restart reproducibility benchmark target and integrate it into GitHub Actions functional tests.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/functional/3d-evp-regular.cfg | Adjust restart-related config values for CI restart comparison workflow. |
| tests/functional/2d-ep-irregular.cfg | Add restart-related config values for CI restart comparison workflow. |
| parameters.hpp | Add reference_frame_time and info_display_next_step to persisted simulation state. |
| output.hpp | Track restart overwrite/rename behavior and start frame in Output. |
| output.cxx | Write persisted viscosity, expand checkpoint scalars, and add rename-on-restart protection in outputs/checkpoints. |
| markerset.hpp | Update interfaces to support deterministic seeded marker placement. |
| markerset.cxx | Implement seeded RNG for marker placement and seed computation for replenishment paths. |
| dynearthsol.cxx | Initialize/read additional restart state, reorder restart initialization, and persist output scheduling state. |
| binaryio.hpp | Add optional rename-on-open behavior for output writers. |
| binaryio.cxx | Implement .old/.old2/... backup renaming and adjust Array2D load behavior for restart allocations. |
| benchmarks-cores/Makefile | Add fresh-restart-cmp target and safer handling of same-modelname restarts. |
| benchmarks-cores/compare.py | Refine comparison reporting and exit codes; enable “BIT-EXACT” status. |
| array2d.hpp | Add should_strip option to load_from_buffer to avoid unnecessary reallocations. |
| .github/workflows/functional-tests.yml | Add Python deps and new CI steps to run restart reproducibility checks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e635cef to
dfd1cdb
Compare
|
@chaseshyu Great job! 👍 I think this is a comprehensive improvement to the restart functionality. I'll follow the suggested testing procedure and submit my review. |
|
@chaseshyu One question about Output file safety on same-modelname restart. Is the purpose to preserve the data that becomes the initial condition for a restart? If the identity of the two is ensured now, is it still going to be necessary? I'm actually fine with making copies, but then my next question is whether it's okay to overwrite the subsequent save files. That's what's happening but an alternative would be to apply the same policy. I mean, if checkpoint 3 was the restart point, save.000003 and save.000003.old are created in the beginning. When save.000004 exists, it's also renamed save.000004.old and the restarted run writes its own save.000004. |
|
@echoi Yes, if the identical outputs of the two is ensured, the file might be the same with same cfg. For the second question, the primary intention behind this .old implementation is to preserve the restart frame, which might be needed for several trial restarts when doing model setup. In my opinion, while all .vtkhdfs stores the data results for frames, the restart frame functions differently and as part of the configuration and input. Therefore, protecting the restart configuration makes sense to me. These are just some of my thoughts. I'd love to hear if anyone else has a different perspective on this.
|
8160b30 to
59801de
Compare
|
@chaseshyu Thanks for the clarifications. Now I better understand the safety measure for output files. If anyone chose to restart from an earlier frame than the last one and set
Isn't it a Lastly, all in the current set of the suggested tests have been checked off. |
|
@echoi Thanks for your question and suggestion! A restart actually needs both |
59801de to
8952b59
Compare
… in benchmarks-cores/Makefile and enhance comparison logic in compare.py bugfix
…ckpoint processes
…NG_FROM_MODELNAME
…te checks when restart
…now is not exactly at quality_check_step_interval
Previously, mass was computed before the checkpoint values were fully read, leading to incorrect results. This change delays the computation until the state is properly restored.
…e in restart function
… heating calculations
…erministic result
8952b59 to
5caa943
Compare
|
Docs for developer tools |
Background
A restarted simulation should produce output identical to a run that never stopped. In practice, several classes of state were either not checkpointed, read back in the wrong order, or recomputed from scratch in ways that introduced divergence. This branch fixes all of them, adds a tooling target to verify restart reproducibility in CI, and cleans up a handful of related correctness issues found along the way.
Changes
1. Checkpoint state completeness
Several scalar fields were computed fresh on restart instead of being restored from the checkpoint, causing the restarted run to diverge immediately:
dthas_initial_checkpoint)compute_dtmoved before thehas_initial_checkpointwrite so the checkpoint always contains a validdt; on restartdtis loaded from checkpointdt_PT!is_restartingbranch; never set on restartdtunconditionally after the checkpoint/restart block for both fresh and restarted runsmax_global_vel_magcompute_massbefore velocity was loadedreference_frame_timeVariables; persisted in checkpointinfo_display_next_stepmain(); lost on restartVariables; persisted in checkpointdhaccstrain_rate,viscosity,volume_oldThe checkpoint scalar bundle was extended from 3 fields (
time,compensation_pressure,bottom_temperature) to 7 (addinginfo_display_next_step,dt,max_global_vel_mag,reference_frame_time). Both HDF5 and binary paths updated.2. Restart field-read ordering fix for
compute_masscompute_masswas called before the checkpoint fields they depend on were loaded. The new order is loading all save/checkpoint arrays and scalars first and callingcompute_mass(needsmax_global_vel_mag).3. Deterministic marker seeding
append_random_marker_in_elemwas calling an unseededrandom_eta()— drawing from a global PRNG whose state depended on execution history. The function now accepts aseed = element + num_marker_in_elem + stepand usesstd::mt19937with that seed, making marker placement reproducible across fresh runs and restarts.The internal
random_eta_seedhelper was templated to accept both raw pointer and span-like accessors (ShapefnAccessor).4. Viscosity field lifecycle
Previously,
output.cxxcomputed viscosity on-the-fly at output time withmat->visc(e)into a scratch buffer. This meant the saved viscosity was always the current-step recomputed value, not the viscosity that was actually used in that step's stress update. Changed to:init()populatesvar.viscosityfrommat->visc(e)at startup.output.cxxwrites*var.viscositydirectly (already the step's actual viscosity).viscosityfrom the save file, so tidal/shear heating uses the correct value.5. Output file safety on same-modelname restart
When
modelname == restarting_from_modelname, the first output frame on restart would silently overwrite the corresponding save/checkpoint files from the original run. Now:rename_to_old_backup()helper renames the existing file to.old,.old2, etc. (finding the next available slot) before opening for write.Makefileavoids the dangerousrm -f ${MODELNAME}.*cleanup in the same-modelname case.6. Memory leak fix on restart
Array2D::load_from_bufferallocated a new backing array unconditionally even when the array was already allocated. On restart, marker arrays that were pre-sized from the checkpoint had their existing allocation leaked. Fixed by checkinga_ == nullptrbefore allocating, and resizing in place otherwise. Theshould_stripparameter (defaulttrue) preserves the original behaviour for callers that want to shrink the array.7. Restart reproducibility CI
benchmarks-cores/Makefile: newfresh-restart-cmptarget — runs a fresh simulation toNframes, then immediately restarts from framerestarting_from_frameand compares output at frameFRAME. Both the fresh run and the restarted run must produce identical field values..github/workflows/functional-tests.yml: two new CI steps invokefresh-restart-cmpfor the existing2d-ep-irregularand3d-evp-regulartest configs.benchmarks-cores/compare.py: refactored to return(n_fail, n_nonzero)so the caller can distinguish bit-exact / round-off / seriously wrong; exits with code 1 on failure so CI catches divergence. Status line now printsBIT-EXACTwhen all fields are identical. NaN/Inf fields are now caught explicitly (return(1, 1)and print"NaN/Inf — field corrupt") instead of crashing or silently passing. CLI signature simplified to<path/to/old-modelname> <path/to/new-modelname> <frame>; error exits cleanly when either path does not exist.8. Developer documentation
DEVELOPING.md: new "Benchmarks and regression testing" section documenting themake set / cmp / fresh-restart-cmpworkflow andcompare.pyexit codes.benchmarks-cores/Makefile: added a header comment block documenting all variables (CASE,FRAME,OMP,NDIMS,ACC,DEBUG,NSYS,VTK) and all targets.benchmarks-cores/compare.py: added module docstring with full usage, exit-code reference, and a restart-troubleshooting walkthrough (how to isolate which field first diverges using backed-up.oldsave files).Test plan
make ndims=2andmake(3D) build cleanlymake fresh-restart-cmp NDIMS=2 CASE=../tests/functional/2d-ep-irregular.cfgexits 0 with Status: BIT-EXACTmake fresh-restart-cmp CASE=../tests/functional/3d-evp-regular.cfgexits 0 with Status: BIT-EXACTmodelname == restarting_from_modelnameproduces.oldbackup files and does not corrupt the original output🤖 Generated with Claude Code