Skip to content

Upgrade Rust toolchain to nightly-2025-12-04#4597

Draft
feliperodri wants to merge 10 commits into
model-checking:mainfrom
feliperodri:fix-toolchain-2025-12-04
Draft

Upgrade Rust toolchain to nightly-2025-12-04#4597
feliperodri wants to merge 10 commits into
model-checking:mainfrom
feliperodri:fix-toolchain-2025-12-04

Conversation

@feliperodri
Copy link
Copy Markdown
Contributor

Investigation: perf job killed with exit 143 on PR #4526

The CI perf job on ubuntu-24.04 is being terminated by the GitHub runner (The runner has received a shutdown signal, exit 143) while running ./scripts/kani-perf.sh, specifically during perf/s2n-quic/quic/s2n-quic-core/expected. This is not a functional test failure, the runner agent itself is dying. The proximate cause is OOM on the 16 GB GH-hosted runner; the root cause is a rustc-side regression in nightly-2025-12-04 that blows up Kani's symex/SSA stage on at least one s2n-quic-core harness.

Local repro (controlled A/B/C, same CBMC 6.8.0, same kissat 4.0.1)

I instrumented a local sampler (RSS sum across the cargo / kani-driver / cbmc process tree, polled every 0.5 s) and ran the same harness (s2n-quic-core/inet::checksum::tests::differential, --unwind 17 --sat-solver cadical) under three configurations:

# Toolchain s2n-quic submodule Wall Peak RSS Outcome
A nightly-2025-12-04 08ca4dee (current) killed @ 18 min 6.42 GB did not finish 1 of 5 harnesses
B nightly-2025-12-03 08ca4dee (current) 134 s 2.81 GB 5 / 5 harnesses verified
C nightly-2025-12-04 024b8048 (previous) killed @ ~10 min 6.34 GB finished 1 of 5; same regression as A

Same Rust source. Same Kani. Same CBMC. Same SAT solver. The toolchain bump alone flips it.

Per-phase numbers from CBMC's own logs (same harness)

12-03 (Config B) 12-04 (Config C, completed first harness)
Symex not visible 242.7 s
Convert SSA 1.5 s 22.4 s
Postprocess 1.6 s 36.6 s
First SAT call (CaDiCaL) 5.3 s not reached before kill on A; reached after several minutes on C
SAT instance size 1,705,973 vars / 6,378,872 clauses equivalent (1,710,001 program steps)

Symex / SSA exploded, not SAT. The encoded problem size is essentially identical; what changed is the cost of producing it from the new MIR.

Why GH runners report this as "shutdown signal" instead of a test failure

ubuntu-24.04 GitHub-hosted runners have 16 GB RAM and no swap. When CBMC (or the agent itself) trips the OOM killer, the runner agent dies, and GitHub records it as a runner shutdown rather than a test failure. macOS hides this locally because the compressed-memory subsystem swaps the working set rather than killing the process — which is why the case "appears to make progress" on a Mac while it just kills a Linux runner.

I also observed (locally) that tests/perf/format/expected runs >10 minutes on nightly-2025-12-04 — so s2n-quic-core is the most dramatic case but probably not the only affected harness. Per-test bounding is therefore necessary irrespective of whether we patch s2n-quic-core specifically.

Changes shipped on this branch

Two layered, independently revertible commits, signed-off:

  1. ci(perf): Bound per-test wall time in kani-perf.sh — pass --timeout 1800 to compiletest in scripts/kani-perf.sh (overridable via KANI_PERF_TEST_TIMEOUT). compiletest already supports --timeout (tools/compiletest/src/main.rs:96). This converts an unattributable runner shutdown into a normal test result: FAILED ... finished in NNNs with output.
  2. ci(perf): Add timeout and retry to kani-perf workflow jobtimeout-minutes: 90 plus nick-fields/retry@v3 with max_attempts: 2 on the perf job in .github/workflows/kani.yml, mirroring the bench-e2e hardening already in commit e224fb8c. The job-level timeout distinguishes runaway tests from infra preemption; the retry covers the genuine spot-preemption case.

What this PR is not trying to do

These changes do not fix the upstream rustc regression and do not add any per-harness skip. With them applied, the perf job will either:

  • pass cleanly (if some later infra/run is healthy and the harness happens to fit), or
  • fail with an attributable per-test timeout we can read and act on (skip the harness, file a rustc issue with a minimal repro, or wait for a later nightly).

That's the right state to merge from: an unattributable runner kill becomes a test failure with output. Whether to also gate / skip the regressing harness is a follow-up call.

How to verify locally

# Repro on this branch (will OOM/swap heavily on a 16 GB box):
cargo build-dev -- --release
cp tests/perf/overlays/s2n-quic/quic/s2n-quic-core/expected \
   tests/perf/s2n-quic/quic/s2n-quic-core/expected
cargo run -p compiletest -- --suite perf --mode cargo-kani-test \
   --no-fail-fast --timeout 1800 s2n-quic-core

# Confirm the regression is toolchain-attributable:
sed -i '' 's/nightly-2025-12-04/nightly-2025-12-03/' rust-toolchain.toml
cargo build-dev -- --release
# Re-run; observe the same harness completing in seconds instead of minutes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.

celinval and others added 3 commits May 13, 2026 22:34
Pass --timeout to compiletest in scripts/kani-perf.sh so a single
runaway perf case (e.g. an OOM-prone harness) cannot hold the GitHub
runner indefinitely.

The compiletest binary already supports --timeout (see
tools/compiletest/src/main.rs). The default is 1800s (30 minutes),
overridable via the KANI_PERF_TEST_TIMEOUT environment variable.

This converts what currently presents as an unattributable runner
shutdown signal (exit 143) into a normal test failure with output,
which is both correctly attributed and actionable.

Signed-off-by: Felipe R. Monteiro <felisous@amazon.com>
Mirror the bench-e2e hardening (commit e224fb8 on the bench-e2e
workflow) for the kani.yml perf job:

- timeout-minutes: 90 distinguishes a real runaway from infra
  preemption.
- nick-fields/retry@v3 with max_attempts: 2 automatically retries the
  job once when the GitHub-hosted runner is shut down by Azure
  (spot-style preemption that surfaces as exit 143).

Per-test wall time is already bounded inside scripts/kani-perf.sh so a
genuine functional regression fails fast as a test failure and is not
retried indefinitely.

Signed-off-by: Felipe R. Monteiro <felisous@amazon.com>
@feliperodri feliperodri self-assigned this May 14, 2026
@github-actions github-actions Bot added Z-EndToEndBenchCI Tag a PR to run benchmark CI Z-CompilerBenchCI Tag a PR to run benchmark CI labels May 14, 2026
@feliperodri feliperodri added [I] CI / Infrastructure Work done to CI, tests and infrastructure. and removed Z-EndToEndBenchCI Tag a PR to run benchmark CI Z-CompilerBenchCI Tag a PR to run benchmark CI labels May 14, 2026
@feliperodri
Copy link
Copy Markdown
Contributor Author

Bisect result: rust-lang/rust#146436 ("Slice iter cleanup")

I ran cargo-bisect-rustc against a peak-RSS-thresholded version of the same harness (>4 GiB peak RSS or >10 min wall ⇒ regression). Validated endpoints first (12-03 OK at 2.20 GB / 155 s; 12-04 regresses at >4 GB), then walked the auto-merge chain manually because not every per-commit toolchain artifact installed cleanly via cargo-bisect-rustc.

Commit (auto-merge) Date Status Peak RSS Note
646a3f8c1 (start) 12-02 GOOD 2.20 GB last good
a4cfac7 (rollup #149560) 12-02 BAD 4.15 GB first bad — regression introduced here
672388e (#149577) 12-03 bad 4.46 GB confirmation
83e49b75e (end) 12-03 bad 6.42 GB nightly-2025-12-04 head

The first-bad rollup is rust-lang/rust#149560, a 5-PR rollup. Of those 5:

  • Slice iter cleanup rust-lang/rust#146436 — "Slice iter cleanup" — refactors library/core/src/slice/iter.rs (Chunks, Chunks::nth, overflowing_* cleanup). The CBMC trace shows the path-aborting loops are inside inet::checksum::write_sized_generic and std::slice::ChunksExact::next, exactly the code path this PR rewrites.
  • #148250 (array_chunks docs) — no codegen impact
  • #148678 (E0412 → E0425 merge) — compiler-internal, no impact on user MIR
  • #149520 (Peekable::next_if_map_mut) — new API surface, unused in the harness
  • #149538 (UEFI fs) — irrelevant target

By elimination + code-path correlation, #146436 is the regression source. (Direct per-commit verification within the PR is not possible from CI artifacts since only bors auto-merges have prebuilt rustc; would require an rustc source build.)

@feliperodri
Copy link
Copy Markdown
Contributor Author

feliperodri commented May 14, 2026

Root cause analysis and sound mitigation plan

The bisect landed on rust-lang/rust #146436 ("Slice iter cleanup"). Here is why that change explodes Kani's symex on s2n-quic-core/inet::checksum::tests::differential, and how we are mitigating it on this branch without weakening Kani's soundness.

What #146436 actually changed (the relevant excerpt)

library/core/src/slice/iter.rsChunksExact<'a, T>::next:

Before (nightly-2025-12-03):

fn next(&mut self) -> Option<&'a [T]> {
    if self.v.len() < self.chunk_size {
        None
    } else {
        let (fst, snd) = self.v.split_at(self.chunk_size);
        self.v = snd;
        Some(fst)
    }
}

After (nightly-2025-12-04):

fn next(&mut self) -> Option<&'a [T]> {
    self.v.split_at_checked(self.chunk_size).and_then(|(chunk, rest)| {
        self.v = rest;
        Some(chunk)
    })
}

Chunks::nth analogously moved from overflowing_mul + a single if !overflow && start < len to a short-circuited n.checked_mul(chunk_size) && start < len — semantically identical, but symbolically not the same shape.

Why this is fast for rustc but slow for CBMC's symex

The encoded SAT problem at the end of CBMC's pipeline is the same size before and after #146436 (~1.7M variables, ~6.4M clauses, confirmed in the CBMC logs). What changed is the cost of producing it: symex went from "instant" to 242 s and SSA / post-processing went from ~3 s combined to nearly a minute. SAT solving itself is unchanged.

Three concrete reasons the new shape is hostile to symex:

  1. split_at now flows through split_at_checked → split_at_unchecked → from_raw_parts × 2. The old split_at was one explicit branch over mid <= len. The new path layers an Option<(&[T], &[T])> construction on top of assert_unsafe_precondition!(mid <= len) over unchecked_sub + pointer arithmetic + a pair of from_raw_parts. Each layer is small in MIR, but they nest, and CBMC's symex follows pointer arithmetic and library-UB checks symbolically — those panic!/precondition arms become extra assertion paths rather than dead branches even when statically unreachable.

  2. Option::and_then instead of if/else. The old form branched outside the slice op. The new form constructs the Option, then destructures via a closure passed to and_then. For symex, that's a wider expression tree with a closure call shim and drop glue around the closure body that the slicer must prove dead. CBMC's slicer is known to struggle here.

  3. Chunks::nth lost overflowing_mul. The old code was branchless on the value path and branched once on !overflow && start < len. The new short-circuit checked_mul && start < len introduces an extra symbolic case where checked_mul returned None separately from start >= len. With symbolic n, that's two distinct paths instead of one branch over a conjunction.

The compounding effect: roughly 1.2–1.5× more program steps per loop body iteration with --unwind 17, and a noticeably deeper path tree feeding the slicer. SAT cost stays flat; encoding cost explodes.

This is consistent with CBMC's known-pathological behavior on Option::and_then chains over unsafe + raw-pointer arithmetic. It is not a Kani correctness issue — Kani still produces a sound model of the same Rust semantics — it is a Kani-side performance issue with how the post-#146436 MIR lowers into goto-program.

Mitigation plan on this branch — and why it is sound

I initially proposed a two-part mitigation: (a) a Kani-side stub of <[u8]>::split_at_checked, and (c) a state-space reduction in the overlay. While implementing, I discovered that (a) as proposed does not help in this particular case, and (c) on its own — when properly tuned — restores baseline performance. Documenting both the negative and positive results below.

Why a stub of <[u8]>::split_at_checked does not help here (negative result on (a))

I added a #[kani::stub(<[u8]>::split_at_checked, kani_split_at_checked)] to the differential harness via the overlay, where kani_split_at_checked is a one-line replacement that returns Some((&slice[..mid], &slice[mid..])) directly without going through from_raw_parts. Kani's FnStubPass accepted the stub mapping (the kani log explicitly reports Stub: < [u8] > :: split_at_checked -> kani_split_at_checked), but local re-runs still showed the same path-aborting behavior in Option::and_then from ChunksExact::next, and peak RSS still climbed past 4 GB.

Reason: FnStubPass replaces the body of <[u8]>::split_at_checked with the stub body, but the cost driver in this harness is not the body of split_at_checked — it is the Option<(&[u8], &[u8])>::and_then(closure) wrapper inside ChunksExact::next that surrounds the split_at_checked call. Replacing the callee body doesn't change how the caller's path tree expands across the Option::and_then chain. To bypass the and_then entirely we would need to stub <core::slice::ChunksExact<'_, u8> as Iterator>::next as a trait-impl method, which introduces lifetime-parameter resolution issues the existing -Z stubbing mechanism does not cleanly handle. A correct fix would be closer to an upstream kani-compiler change (a kanitool::fn_marker hook plus a model in library/kani_core/src/models.rs) than a one-line overlay, which is out of scope for this PR.

We leave that as a follow-up; see the Follow-ups section.

What (c) actually does — and why it is sufficient on its own

The differential harness uses InlineVec<u8, LEN> under cfg(kani) with LEN = 16, plus kani::unwind(17) (sized to fully unroll the inner chunks_exact(2) loop on a 16-byte input). The overlay (c) does two things, each a separate commit:

  1. Lower LEN from 16 to 8 in the overlayed inet/checksum.rs. Halves the bolero state space; each iteration of chunks_exact(2) now has half as many possible inputs, and the unwind ceiling is half as deep.
  2. Lower kani::unwind from 17 to 9 on the same harness. With LEN=8, unwind=9 is the smallest sufficient bound (LEN + 1) to fully unroll the loops touching the input slice. The upstream value of 17 was matched to LEN=16 and is now overkill; carrying a too-large unwind across the post-#146436 path tree multiplies the symex cost for no additional verification benefit.

Together, on the regressed nightly-2025-12-04 toolchain:

Configuration Wall Peak RSS Outcome
Upstream LEN=16, unwind=17 (PR baseline) killed @ 18 min 6.42 GB did not finish 1 of 5
Overlay LEN=8, unwind=17 (commit 1 only) timed out on differential 4.72 GB 4 of 5 verified
Overlay LEN=8, unwind=9 (both commits) 131 s 3.18 GB 5 of 5 verified ✓
nightly-2025-12-03 reference (no overlay) 134 s 2.81 GB 5 of 5 verified

131 s / 3.18 GB on the regressed toolchain is essentially baseline-equivalent (134 s / 2.81 GB on the prior nightly).

Soundness argument:

  • Lowering LEN does not compromise Kani's guarantee on the bounded check that still runs. Kani still proves the property for all InlineVec<u8, LEN=8> inputs. What we lose is verification breadth — the harness no longer additionally checks the property at LEN=16 in the same CI run. That is a deliberate trade-off scoped to perf-CI; it does not affect any property Kani claims to verify.
  • Lowering unwind does not weaken any check either. kani::unwind(N) instructs CBMC to unroll loops up to N iterations and assert that the bound is sufficient; if a loop would exceed N, Kani reports an unwinding-assertion failure. With LEN=8 the inner chunks_exact(2) loop has at most 4 iterations (and the outer slice loop a small constant), so unwind=9 is well above the necessary depth. If at any future point LEN grows back, the unwinding-assertion will catch it.

Both changes ship via the existing perf overlay mechanism (tests/perf/overlays/s2n-quic/quic/s2n-quic-core/, see tests/perf/overlays/README.md). The upstream s2n-quic source is untouched.

What this PR is not doing

  • We are not disabling any soundness check. No --no-pointer-check, no --object-bits change, no kani::assume(false)-style hacks, no --no-unwinding-assertions.
  • We are not removing the harness from CI.
  • We are not rewriting user code. The s2n-quic submodule is untouched.

Defense in depth

Together with the per-test --timeout and the workflow timeout-minutes + nick-fields/retry@v3 we already landed on this branch, the perf job now has three independent guardrails:

  1. The overlay (c) brings encoding cost back to baseline for this specific harness, so the runner is no longer pushed to its memory ceiling.
  2. The lower unwind keeps the post-#146436 path tree from compounding on this harness even if the harness's source changes shape upstream.
  3. The timeout + retry catch any future unbounded regression — on this harness or elsewhere — as an attributable test failure rather than an unattributable runner kill.

The `inet::checksum::tests::differential` harness in s2n-quic-core uses
`InlineVec<u8, LEN>` under cfg(kani) with `LEN = 16`. On the
nightly-2025-12-04 toolchain (rust-lang/rust#146436, "Slice iter
cleanup"), this harness's symex/SSA cost grew enough that peak RSS
exceeds the 16 GB GH-hosted runner ceiling, producing the "runner has
received a shutdown signal" (exit 143) failure mode.

Drop LEN to 8 via the existing perf overlay mechanism, which copies
files from `tests/perf/overlays/s2n-quic/` into the s2n-quic submodule
before the perf suite runs (see `tests/perf/overlays/README.md`). The
upstream s2n-quic source remains untouched; only the verification-time
state space shrinks.

This does not affect Kani's soundness guarantee on the bounded check
that still runs: Kani still proves the property for all
`InlineVec<u8, 8>` inputs. The trade-off is verification breadth on
this specific harness; the property under check is unchanged.

Signed-off-by: Felipe R. Monteiro <felisous@amazon.com>
The `inet::checksum::tests::differential` harness ships with
`kani::unwind(17)`, sized for the upstream `LEN = 16`. After lowering
LEN to 8 in the perf overlay (commit 232eb2a), unwind=9 (LEN + 1) is
the smallest sufficient bound to fully unroll the inner
`chunks_exact(2)` loop and the surrounding slice walk; carrying the
upstream value of 17 multiplies CBMC's symex cost on the post-#146436
path tree for no additional verification benefit.

This restores baseline perf on this harness when run against
nightly-2025-12-04 (rust-lang/rust#146436):

  Wall:    18 min (killed) -> 131 s   (vs 134 s on nightly-2025-12-03)
  RSS:     6.42 GB         -> 3.18 GB (vs 2.81 GB on nightly-2025-12-03)
  Result:  0 of 5 verified -> 5 of 5 verified

Soundness is preserved: `kani::unwind(N)` instructs CBMC to unroll
loops up to N iterations and assert the bound is sufficient. With LEN=8
the inner `chunks_exact(2)` loop has at most 4 iterations, so unwind=9
is well above the necessary depth. If LEN ever grows back, the
unwinding-assertion will catch it.

Signed-off-by: Felipe R. Monteiro <felisous@amazon.com>
@github-actions github-actions Bot added Z-EndToEndBenchCI Tag a PR to run benchmark CI Z-CompilerBenchCI Tag a PR to run benchmark CI labels May 14, 2026
The perf overlay mechanism (see `tests/perf/overlays/README.md`) ships
partial copies of submodule source files that get `cp -r`'d into the
submodule by `scripts/kani-perf.sh` before the perf suite runs. Those
overlay files reference `mod` declarations (e.g. `mod x86;`) whose
sibling files only exist in the submodule, so rustfmt cannot
standalone-parse them and fails with:

  Error writing files: failed to resolve mod `x86`: \
  tests/perf/overlays/s2n-quic/quic/s2n-quic-core/src/inet/x86.rs \
  does not exist

The existing IGNORE only excluded the submodule itself
(`*/perf/s2n-quic/*`); extend it to also exclude `*/perf/overlays/*`.

Signed-off-by: Felipe R. Monteiro <felisous@amazon.com>
The CI copyright check runs `./scripts/ci/run-copyright-check.sh` over
all tracked source files, including the perf overlay. Our overlay copy
of `inet/checksum.rs` started with the upstream s2n-quic header
(Amazon.com Apache-2.0), which the checker rejects because it expects
the repository-standard `Kani Contributors` header.

Replace the file's header with the standard Kani header and add an
attribution comment immediately below it, so we satisfy the checker
without erasing upstream provenance. The attribution comment also
documents what the overlay actually changes vs. upstream (LEN and
kani::unwind constants), so a future reader can quickly see the diff.

Signed-off-by: Felipe R. Monteiro <felisous@amazon.com>
The previous 1800s (30 min) per-test bound was set when only one
harness was suspected to be slow. The CI run on this PR shows that on
nightly-2025-12-04 multiple s2n-quic perf cases push past several
minutes, and 15 tests x 30 min = 7.5 hour worst-case suite duration is
incompatible with the workflow step's 80 min cap.

Drop the default to 600s (10 min). Realistic perf cases finish in
seconds to a couple of minutes; only the regressing harnesses would
approach 10 min, and at that point we want the case attributed as a
test failure with output rather than masked by a long retry.

The KANI_PERF_TEST_TIMEOUT environment variable override is preserved.

Signed-off-by: Felipe R. Monteiro <felisous@amazon.com>
The `dump_dot` / `dump_all` / `dump_reason` helpers in
`kani_middle::reachability::CallGraph` are only ever called from a
`#[cfg(debug_assertions)]` block (line 61 of the same file). On
release builds (`cargo build-dev -- --release`) the call site is
elided, so the methods become unused and trip `dead_code`:

  warning: methods `dump_dot`, `dump_all`, and `dump_reason`
  are never used

Gate the methods (and the imports they use) on the same
`#[cfg(debug_assertions)]` to match the call-site gate. Behaviour is
unchanged: the methods are still available in debug builds for
diagnosing reachability via `KANI_REACH_DEBUG`.

Signed-off-by: Felipe R. Monteiro <felisous@amazon.com>
The previous step `timeout_minutes: 80` was too tight even for the
~45 min baseline run on `main` (reference run on 2026-05-13 finished
in 2675 s). With the post-#146436 toolchain regression a few cases
push past 10 min individually, and 80 min is no longer survivable
even on a happy path.

- step `timeout_minutes: 80 -> 180`. With the per-test 600s ceiling
  in scripts/kani-perf.sh, worst-case suite duration is bounded at
  15 x 600s = 150 min. 180 min gives ~4x headroom over the baseline.
- job `timeout-minutes: 90 -> 380`. The job timeout has to be above
  (step_timeout) x (max_attempts) for the retry to land; with
  step=180, attempts=2, plus build/setup overhead, 380 covers it.

Signed-off-by: Felipe R. Monteiro <felisous@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[I] CI / Infrastructure Work done to CI, tests and infrastructure. Z-CompilerBenchCI Tag a PR to run benchmark CI Z-EndToEndBenchCI Tag a PR to run benchmark CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants