Skip to content

rsz: drop stale STA search state after repair_timing (#10210)#10638

Draft
minjukim55 wants to merge 1 commit into
The-OpenROAD-Project:masterfrom
The-OpenROAD-Project-staging:secure-rsz-crpr-stale-clear-10210
Draft

rsz: drop stale STA search state after repair_timing (#10210)#10638
minjukim55 wants to merge 1 commit into
The-OpenROAD-Project:masterfrom
The-OpenROAD-Project-staging:secure-rsz-crpr-stale-clear-10210

Conversation

@minjukim55

Copy link
Copy Markdown
Contributor

Summary

On some designs repair_timing makes a later report_metrics abort
deterministically (public issue #10210). Root cause is in OpenSTA: a CRPR
clock path cached in an interned ClkInfo/Tag keeps a raw prev_path_
into the per-vertex Path arena, which repair_timing's incremental updates
free and recycle — leaving a dangling pointer that CheckCrpr::findCrpr later
walks and dereferences out of bounds. PR #3343 fixed four rsz move consumers
via prevArc(); CRPR walks the prev-path chain, so it is the remaining
unfixed consumer (the single-hop prevArc trick does not apply).

This PR is an OpenROAD-side mitigation — the root fix belongs in OpenSTA.
At the end of the repair_timing command, once a repair has run, it drops the
interned search state so nothing stale survives into the next analysis.
TODO(#10210) marks it for removal once OpenSTA stops dangling persisted CRPR
paths.

Mechanism

A cached pointer outlives the memory slot it points at:

flowchart TD
    A["Step 1 — repair_timing caches a clock-path pointer to slot A (valid)"]
    B["Step 2 — repair frees slot A, then reuses it for a different path"]
    C["Step 3 — the cached pointer still points at slot A, now garbage (dangling)"]
    D["Step 4 — report_metrics follows the pointer, reads garbage, crashes"]
    A --> B --> C --> D
Loading

In code: the cached path lives in an interned ClkInfo/Tag that persists
across updates. During repair_timing, Search::setVertexArrivals calls
Vertex::deletePaths() + makePaths(), which frees and recycles the slot the
cached prev_path_ points at — but prev_path_ is never fixed up. A later
CheckCrpr::findCrpr walks that chain and Path::vertex() indexes
ObjectTable<Edge> out of bounds.

Crash stack

flowchart TD
    A["report_metrics / report_checks"] --> B["Search::findPathEnds"]
    B --> C["latch endpoint → Latches::latchBorrowInfo"]
    C --> D["CheckCrpr::checkCrpr → findCrpr"]
    D -->|walk prev_path_ chain| E["Path::vertex()"]
    E --> F["graph edge(prev_edge_id_) — stale id"]
    F --> G["ObjectTable::pointer → blocks_[blk_idx]"]
    G -->|blk_idx out of range| H["abort (SIGABRT / SIGSEGV)"]
Loading

Fix

Resizer::resetSearchAfterRepair() = Search::clear() (drop interned
tags / paths / path groups) + updateTiming(full), called at the end of the
repair_timing command when a repair actually happened. Graph, parasitics and
arc delays are preserved; only arrivals are recomputed. rsz cannot surgically
purge just the stale clock paths, so it drops all interned search state and lets
the next analysis rebuild from scratch.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Impact

  • Prevents a deterministic abort in report_metrics/report_checks after
    repair_timing on affected designs.
  • No change to timing results: the reset rebuilds the same arrivals from the
    current netlist; it only discards stale interned search state.

Runtime impact (measured)

The reset adds one from-scratch arrival recompute per repair_timing
invocation that actually repaired. Search::clear() itself is negligible; the
cost is arrival re-propagation — graph, parasitics and arc delays are preserved,
so delay calculation is not re-run. Measured on a customer design
(188,731 instances) at global route:

measurement time
resetSearchAfterRepair(), 2 trials 1.21 s / 1.21 s
ref: cold find_timing (delays + arrivals) 5 s
ref: estimate_parasitics -global_routing 10 s
repair_timing command total (reset included) 267 s
whole global-route step 17.5 min

So the reset is ~0.45% of the repair_timing command and ~0.1% of the
global-route step. It fired once here (recover_power was a no-op). Across a full
flow repair_timing runs at a few stages (post-place / CTS / global route), so
the cumulative cost is a handful of seconds — negligible against repair and
routing runtime. A clean with/without A/B is not possible: the unfixed binary
aborts in report_metrics before the step completes.

Verification

  • Reproduced the abort on a customer post-CTS design at global route
    (report_metrics after repair_timing); with this change the step completes
    cleanly — findPathEnds enumerates all endpoints with no crash.
  • Unit test TestDbSta.StalePrevPath extended: it reproduces the free+recycle
    staleness and then asserts resetSearchAfterRepair() rebuilds a clean, fully
    walkable search (every prev-path node's vertex resolves).
  • rsz regression suite: 85/85 pass.
  • clang-format / tclint / tclfmt clean.

Related Issues

@minjukim55 minjukim55 self-assigned this Jun 11, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a workaround for issue #10210, where incremental repairs can leave a stale CRPR clock path that causes subsequent reports to crash. It introduces the resetSearchAfterRepair() method to clear the interned STA search state and recompute timing, integrating this mitigation into the Tcl repair_timing flow and adding corresponding C++ test coverage. Feedback suggests using the inherited member variable search_ directly instead of calling sta_->search() for consistency with the rest of the Resizer class.

Comment thread src/rsz/src/Resizer.cc Outdated
// analysis rebuilds clean; graph/parasitics/delays are kept, only arrivals
// are recomputed. Proper fix: findCrpr should re-resolve clock-path nodes
// from stable graph ids, not raw prev_path_. TODO(#10210): remove when fixed.
sta_->search()->clear();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and directness, consider using the inherited member variable search_ directly instead of calling sta_->search(). This aligns with how search_ is accessed elsewhere in the Resizer class (e.g., search_->arrivalsInvalid()).

  search_->clear();

…oject#10210)

repair_timing's incremental netlist/STA updates free and recycle the
per-vertex Path arena. A CRPR clock path cached in an interned ClkInfo/Tag
keeps a raw prev_path_ into a recycled slot, so a later full-path analysis
(report_metrics -> Search::findPathEnds -> CheckCrpr::findCrpr) walks a
dangling prev-path chain and aborts on an out-of-bounds Edge lookup. This is
the CRPR consumer that PR The-OpenROAD-Project#3343 did not cover.

Reset the interned search state and recompute timing at the end of the
repair_timing command so no stale clock path survives into the next analysis.
Graph, parasitics and arc delays are preserved; only arrivals are recomputed.
This is an OpenROAD-side mitigation; the root fix belongs in OpenSTA.

Extend the existing TestDbSta.StalePrevPath unit test to also cover the reset.

Signed-off-by: Minju Kim <mkim@precisioninno.com>
@minjukim55 minjukim55 force-pushed the secure-rsz-crpr-stale-clear-10210 branch from 573d8b6 to 4d9a719 Compare June 11, 2026 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant