rsz: drop stale STA search state after repair_timing (#10210)#10638
Draft
minjukim55 wants to merge 1 commit into
Draft
rsz: drop stale STA search state after repair_timing (#10210)#10638minjukim55 wants to merge 1 commit into
minjukim55 wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request implements a workaround for issue #10210, where incremental repairs can leave a stale CRPR clock path that causes subsequent reports to crash. It introduces the resetSearchAfterRepair() method to clear the interned STA search state and recompute timing, integrating this mitigation into the Tcl repair_timing flow and adding corresponding C++ test coverage. Feedback suggests using the inherited member variable search_ directly instead of calling sta_->search() for consistency with the rest of the Resizer class.
| // analysis rebuilds clean; graph/parasitics/delays are kept, only arrivals | ||
| // are recomputed. Proper fix: findCrpr should re-resolve clock-path nodes | ||
| // from stable graph ids, not raw prev_path_. TODO(#10210): remove when fixed. | ||
| sta_->search()->clear(); |
Contributor
…oject#10210) repair_timing's incremental netlist/STA updates free and recycle the per-vertex Path arena. A CRPR clock path cached in an interned ClkInfo/Tag keeps a raw prev_path_ into a recycled slot, so a later full-path analysis (report_metrics -> Search::findPathEnds -> CheckCrpr::findCrpr) walks a dangling prev-path chain and aborts on an out-of-bounds Edge lookup. This is the CRPR consumer that PR The-OpenROAD-Project#3343 did not cover. Reset the interned search state and recompute timing at the end of the repair_timing command so no stale clock path survives into the next analysis. Graph, parasitics and arc delays are preserved; only arrivals are recomputed. This is an OpenROAD-side mitigation; the root fix belongs in OpenSTA. Extend the existing TestDbSta.StalePrevPath unit test to also cover the reset. Signed-off-by: Minju Kim <mkim@precisioninno.com>
573d8b6 to
4d9a719
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On some designs
repair_timingmakes a laterreport_metricsabortdeterministically (public issue #10210). Root cause is in OpenSTA: a CRPR
clock path cached in an interned
ClkInfo/Tagkeeps a rawprev_path_into the per-vertex
Patharena, whichrepair_timing's incremental updatesfree and recycle — leaving a dangling pointer that
CheckCrpr::findCrprlaterwalks and dereferences out of bounds. PR #3343 fixed four rsz move consumers
via
prevArc(); CRPR walks the prev-path chain, so it is the remainingunfixed consumer (the single-hop
prevArctrick does not apply).This PR is an OpenROAD-side mitigation — the root fix belongs in OpenSTA.
At the end of the
repair_timingcommand, once a repair has run, it drops theinterned search state so nothing stale survives into the next analysis.
TODO(#10210)marks it for removal once OpenSTA stops dangling persisted CRPRpaths.
Mechanism
A cached pointer outlives the memory slot it points at:
flowchart TD A["Step 1 — repair_timing caches a clock-path pointer to slot A (valid)"] B["Step 2 — repair frees slot A, then reuses it for a different path"] C["Step 3 — the cached pointer still points at slot A, now garbage (dangling)"] D["Step 4 — report_metrics follows the pointer, reads garbage, crashes"] A --> B --> C --> DIn code: the cached path lives in an interned
ClkInfo/Tagthat persistsacross updates. During
repair_timing,Search::setVertexArrivalscallsVertex::deletePaths()+makePaths(), which frees and recycles the slot thecached
prev_path_points at — butprev_path_is never fixed up. A laterCheckCrpr::findCrprwalks that chain andPath::vertex()indexesObjectTable<Edge>out of bounds.Crash stack
flowchart TD A["report_metrics / report_checks"] --> B["Search::findPathEnds"] B --> C["latch endpoint → Latches::latchBorrowInfo"] C --> D["CheckCrpr::checkCrpr → findCrpr"] D -->|walk prev_path_ chain| E["Path::vertex()"] E --> F["graph edge(prev_edge_id_) — stale id"] F --> G["ObjectTable::pointer → blocks_[blk_idx]"] G -->|blk_idx out of range| H["abort (SIGABRT / SIGSEGV)"]Fix
Resizer::resetSearchAfterRepair()=Search::clear()(drop internedtags / paths / path groups) +
updateTiming(full), called at the end of therepair_timingcommand when a repair actually happened. Graph, parasitics andarc delays are preserved; only arrivals are recomputed. rsz cannot surgically
purge just the stale clock paths, so it drops all interned search state and lets
the next analysis rebuild from scratch.
Type of Change
Impact
report_metrics/report_checksafterrepair_timingon affected designs.current netlist; it only discards stale interned search state.
Runtime impact (measured)
The reset adds one from-scratch arrival recompute per
repair_timinginvocation that actually repaired.
Search::clear()itself is negligible; thecost is arrival re-propagation — graph, parasitics and arc delays are preserved,
so delay calculation is not re-run. Measured on a customer design
(188,731 instances) at global route:
resetSearchAfterRepair(), 2 trialsfind_timing(delays + arrivals)estimate_parasitics -global_routingrepair_timingcommand total (reset included)So the reset is ~0.45% of the
repair_timingcommand and ~0.1% of theglobal-route step. It fired once here (recover_power was a no-op). Across a full
flow
repair_timingruns at a few stages (post-place / CTS / global route), sothe cumulative cost is a handful of seconds — negligible against repair and
routing runtime. A clean with/without A/B is not possible: the unfixed binary
aborts in
report_metricsbefore the step completes.Verification
(
report_metricsafterrepair_timing); with this change the step completescleanly —
findPathEndsenumerates all endpoints with no crash.TestDbSta.StalePrevPathextended: it reproduces the free+recyclestaleness and then asserts
resetSearchAfterRepair()rebuilds a clean, fullywalkable search (every prev-path node's vertex resolves).
rszregression suite: 85/85 pass.clang-format/tclint/tclfmtclean.Related Issues
prevArc()