Skip to content

bugfix: make restart deterministic and fix checkpoint state completeness#72

Merged
echoi merged 19 commits into
masterfrom
bugfix/restart-deterministic
Jun 8, 2026
Merged

bugfix: make restart deterministic and fix checkpoint state completeness#72
echoi merged 19 commits into
masterfrom
bugfix/restart-deterministic

Conversation

@chaseshyu

@chaseshyu chaseshyu commented Jun 5, 2026

Copy link
Copy Markdown
Member

Background

A restarted simulation should produce output identical to a run that never stopped. In practice, several classes of state were either not checkpointed, read back in the wrong order, or recomputed from scratch in ways that introduced divergence. This branch fixes all of them, adds a tooling target to verify restart reproducibility in CI, and cleans up a handful of related correctness issues found along the way.


Changes

1. Checkpoint state completeness

Several scalar fields were computed fresh on restart instead of being restored from the checkpoint, causing the restarted run to diverge immediately:

Field Problem Fix
dt Recomputed from scratch; skipped the saved timestep; also zero-initialized when written to the first checkpoint (has_initial_checkpoint) compute_dt moved before the has_initial_checkpoint write so the checkpoint always contains a valid dt; on restart dt is loaded from checkpoint
dt_PT Only assigned in the !is_restarting branch; never set on restart Assigned from dt unconditionally after the checkpoint/restart block for both fresh and restarted runs
max_global_vel_mag Zero-initialized; fed into compute_mass before velocity was loaded Persisted in checkpoint
reference_frame_time Always reset to 0; caused next output frame to be scheduled incorrectly Added to Variables; persisted in checkpoint
info_display_next_step Was a local variable in main(); lost on restart Promoted to Variables; persisted in checkpoint
dhacc Not written to checkpoint; surface marker correction state was lost Added to checkpoint write/read
strain_rate, viscosity, volume_old Not read and resulting in incomplete variables set on restart now read from the save file on restart

The checkpoint scalar bundle was extended from 3 fields (time, compensation_pressure, bottom_temperature) to 7 (adding info_display_next_step, dt, max_global_vel_mag, reference_frame_time). Both HDF5 and binary paths updated.

2. Restart field-read ordering fix for compute_mass

compute_mass was called before the checkpoint fields they depend on were loaded. The new order is loading all save/checkpoint arrays and scalars first and calling compute_mass (needs max_global_vel_mag).

3. Deterministic marker seeding

append_random_marker_in_elem was calling an unseeded random_eta() — drawing from a global PRNG whose state depended on execution history. The function now accepts a seed = element + num_marker_in_elem + step and uses std::mt19937 with that seed, making marker placement reproducible across fresh runs and restarts.

The internal random_eta_seed helper was templated to accept both raw pointer and span-like accessors (ShapefnAccessor).

4. Viscosity field lifecycle

Previously, output.cxx computed viscosity on-the-fly at output time with mat->visc(e) into a scratch buffer. This meant the saved viscosity was always the current-step recomputed value, not the viscosity that was actually used in that step's stress update. Changed to:

  • init() populates var.viscosity from mat->visc(e) at startup.
  • output.cxx writes *var.viscosity directly (already the step's actual viscosity).
  • Restart reads viscosity from the save file, so tidal/shear heating uses the correct value.

5. Output file safety on same-modelname restart

When modelname == restarting_from_modelname, the first output frame on restart would silently overwrite the corresponding save/checkpoint files from the original run. Now:

  • A rename_to_old_backup() helper renames the existing file to .old, .old2, etc. (finding the next available slot) before opening for write.
  • The rename is applied only to the first frame on restart, and only when the modelnames match.
  • Makefile avoids the dangerous rm -f ${MODELNAME}.* cleanup in the same-modelname case.

6. Memory leak fix on restart

Array2D::load_from_buffer allocated a new backing array unconditionally even when the array was already allocated. On restart, marker arrays that were pre-sized from the checkpoint had their existing allocation leaked. Fixed by checking a_ == nullptr before allocating, and resizing in place otherwise. The should_strip parameter (default true) preserves the original behaviour for callers that want to shrink the array.

7. Restart reproducibility CI

  • benchmarks-cores/Makefile: new fresh-restart-cmp target — runs a fresh simulation to N frames, then immediately restarts from frame restarting_from_frame and compares output at frame FRAME. Both the fresh run and the restarted run must produce identical field values.
  • .github/workflows/functional-tests.yml: two new CI steps invoke fresh-restart-cmp for the existing 2d-ep-irregular and 3d-evp-regular test configs.
  • benchmarks-cores/compare.py: refactored to return (n_fail, n_nonzero) so the caller can distinguish bit-exact / round-off / seriously wrong; exits with code 1 on failure so CI catches divergence. Status line now prints BIT-EXACT when all fields are identical. NaN/Inf fields are now caught explicitly (return (1, 1) and print "NaN/Inf — field corrupt") instead of crashing or silently passing. CLI signature simplified to <path/to/old-modelname> <path/to/new-modelname> <frame>; error exits cleanly when either path does not exist.

8. Developer documentation

  • DEVELOPING.md: new "Benchmarks and regression testing" section documenting the make set / cmp / fresh-restart-cmp workflow and compare.py exit codes.
  • benchmarks-cores/Makefile: added a header comment block documenting all variables (CASE, FRAME, OMP, NDIMS, ACC, DEBUG, NSYS, VTK) and all targets.
  • benchmarks-cores/compare.py: added module docstring with full usage, exit-code reference, and a restart-troubleshooting walkthrough (how to isolate which field first diverges using backed-up .old save files).

Test plan

  • make ndims=2 and make (3D) build cleanly
  • In benchmarks-cores, make fresh-restart-cmp NDIMS=2 CASE=../tests/functional/2d-ep-irregular.cfg exits 0 with Status: BIT-EXACT
  • In benchmarks-cores, make fresh-restart-cmp CASE=../tests/functional/3d-evp-regular.cfg exits 0 with Status: BIT-EXACT
  • Restarting with modelname == restarting_from_modelname produces .old backup files and does not corrupt the original output
  • CI functional-tests workflow passes (2D + 3D restart steps green)

🤖 Generated with Claude Code

@chaseshyu chaseshyu force-pushed the bugfix/restart-deterministic branch 4 times, most recently from 8a52b5e to e635cef Compare June 5, 2026 14:01
@chaseshyu chaseshyu requested review from echoi, sungho91 and tan2 June 5, 2026 14:11
@chaseshyu chaseshyu added enhancement New feature or request bugfixes Fix bugs labels Jun 5, 2026
@chaseshyu chaseshyu marked this pull request as ready for review June 5, 2026 14:16
Copilot AI review requested due to automatic review settings June 5, 2026 14:16

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e635cef51f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread output.cxx
Comment thread markerset.cxx

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves restart reproducibility by expanding checkpoint/save state coverage, fixing restart initialization/order dependencies, making marker replenishment deterministic, preventing output overwrites on same-modelname restarts, and adding CI coverage to validate fresh-vs-restart bitwise equivalence.

Changes:

  • Extend checkpointed scalar/state coverage and fix restart read/compute ordering so restarted runs match uninterrupted runs.
  • Make marker replenishment deterministic via seeded RNG, and persist/read additional physics fields (e.g., viscosity) to avoid recomputation drift.
  • Add a restart reproducibility benchmark target and integrate it into GitHub Actions functional tests.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/functional/3d-evp-regular.cfg Adjust restart-related config values for CI restart comparison workflow.
tests/functional/2d-ep-irregular.cfg Add restart-related config values for CI restart comparison workflow.
parameters.hpp Add reference_frame_time and info_display_next_step to persisted simulation state.
output.hpp Track restart overwrite/rename behavior and start frame in Output.
output.cxx Write persisted viscosity, expand checkpoint scalars, and add rename-on-restart protection in outputs/checkpoints.
markerset.hpp Update interfaces to support deterministic seeded marker placement.
markerset.cxx Implement seeded RNG for marker placement and seed computation for replenishment paths.
dynearthsol.cxx Initialize/read additional restart state, reorder restart initialization, and persist output scheduling state.
binaryio.hpp Add optional rename-on-open behavior for output writers.
binaryio.cxx Implement .old/.old2/... backup renaming and adjust Array2D load behavior for restart allocations.
benchmarks-cores/Makefile Add fresh-restart-cmp target and safer handling of same-modelname restarts.
benchmarks-cores/compare.py Refine comparison reporting and exit codes; enable “BIT-EXACT” status.
array2d.hpp Add should_strip option to load_from_buffer to avoid unnecessary reallocations.
.github/workflows/functional-tests.yml Add Python deps and new CI steps to run restart reproducibility checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread markerset.cxx
Comment thread markerset.cxx
Comment thread benchmarks-cores/Makefile Outdated
Comment thread dynearthsol.cxx Outdated
Comment thread benchmarks-cores/compare.py
Comment thread dynearthsol.cxx
@chaseshyu chaseshyu force-pushed the bugfix/restart-deterministic branch from e635cef to dfd1cdb Compare June 5, 2026 19:22
@echoi

echoi commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@chaseshyu Great job! 👍 I think this is a comprehensive improvement to the restart functionality. I'll follow the suggested testing procedure and submit my review.

@echoi

echoi commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

@chaseshyu One question about Output file safety on same-modelname restart. Is the purpose to preserve the data that becomes the initial condition for a restart? If the identity of the two is ensured now, is it still going to be necessary?

I'm actually fine with making copies, but then my next question is whether it's okay to overwrite the subsequent save files. That's what's happening but an alternative would be to apply the same policy. I mean, if checkpoint 3 was the restart point, save.000003 and save.000003.old are created in the beginning. When save.000004 exists, it's also renamed save.000004.old and the restarted run writes its own save.000004.

@chaseshyu

Copy link
Copy Markdown
Member Author

@echoi Yes, if the identical outputs of the two is ensured, the file might be the same with same cfg.
However, there’s another scenario where the user modifies a parameter in the restart configuration file, such as the boundary velocity, thermal perturbation, or plstrain for the weak zone. In this case, this modification could potentially alter the model variables within the same frame .vtkhdf, leading to a contaminated restart frame and a different visualization.

For the second question, the primary intention behind this .old implementation is to preserve the restart frame, which might be needed for several trial restarts when doing model setup. In my opinion, while all .vtkhdfs stores the data results for frames, the restart frame functions differently and as part of the configuration and input. Therefore, protecting the restart configuration makes sense to me.

These are just some of my thoughts. I'd love to hear if anyone else has a different perspective on this.

@chaseshyu One question about Output file safety on same-modelname restart. Is the purpose to preserve the data that becomes the initial condition for a restart? If the identity of the two is ensured now, is it still going to be necessary?

I'm actually fine with making copies, but then my next question is whether it's okay to overwrite the subsequent save files. That's what's happening but an alternative would be to apply the same policy. I mean, if checkpoint 3 was the restart point, save.000003 and save.000003.old are created in the beginning. When save.000004 exists, it's also renamed save.000004.old and the restarted run writes its own save.000004.

@chaseshyu chaseshyu force-pushed the bugfix/restart-deterministic branch 2 times, most recently from 8160b30 to 59801de Compare June 6, 2026 17:55
@echoi

echoi commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@chaseshyu Thanks for the clarifications. Now I better understand the safety measure for output files. If anyone chose to restart from an earlier frame than the last one and set modelname to be equal to restarting_from_modelname, the intention would be understood as literally restarting the original model, not comparing new results for changed parameters or BCs with the original ones (steering in this case). If the latter was intended, the user would (and should) choose a different modelname. It is a reasonable assumption and we cannot be responsible for every user mistake. :)

to preserve the restart frame, which might be needed for several trial restarts when doing model setup.

Isn't it a .chkpt file, not .save that is used for restarting? And this PR ensures the identity of the first .save files of restarted runs as long as they start from the same frame. I guess it's still useful to have a .save.old: Users can rule out at least one possibility when anything goes wrong with restarting. For that purpose, a short documentation on how to use compare.py would be helpful. What do you think, @chaseshyu? Or is it already there?

Lastly, all in the current set of the suggested tests have been checked off.

@chaseshyu

chaseshyu commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

@echoi Thanks for your question and suggestion! A restart actually needs both .chkpt and .save files (e.g., velocity, temperature, strain, stress, plastic strain, radiogenic source, pore pressure, and marker data). Because only .save file is generated in first output, we only create a .old backup for .save.vtkhdf. As for compare.py, it is currently justs a developer tool in benchmarks-cores used to compare model differences during development (started by @tan2). I am not sure if general user need it, but I’d be happy to add some documentation for the benchmarks-cores tools into both compare.py and DEVELOPING.md.

@chaseshyu chaseshyu force-pushed the bugfix/restart-deterministic branch from 59801de to 8952b59 Compare June 8, 2026 16:31
@chaseshyu chaseshyu force-pushed the bugfix/restart-deterministic branch from 8952b59 to 5caa943 Compare June 8, 2026 17:03
@chaseshyu

Copy link
Copy Markdown
Member Author

Docs for developer tools Makefile and compare.py in benchmarks-cores have been updated in themselves and DEVELOPING.md. @echoi

@echoi echoi merged commit 92a4a04 into master Jun 8, 2026
13 checks passed
@echoi echoi deleted the bugfix/restart-deterministic branch June 8, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfixes Fix bugs enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants