Summary
Foreman has no detection or backoff for provider rate limits. A transient 429 / session-limit response is treated as an ordinary run error, so it consumes features instead of pausing and resuming. This is an unattended-safety gap.
Evidence (dogfood campaign)
- At ~23:09 runs began returning
api_error_status: 429 / "You've hit your session limit · resets 3:40am" at $0 cost.
- Foreman treated them as ordinary run errors and burned F3's grill and the entire F4 feature as "failed".
- It also corrupted the metrics (F4 looks like a grill failure when it was purely environmental).
Why it matters
For an unattended run, a transient quota wall should pause and resume, not consume features. Misclassifying quota walls as failures also poisons the campaign metrics and the flywheel's failure clustering.
Proposed fix
- Detect
api_error_status == 429 / session-limit in the stream result.
- Pause-and-retry with backoff (or park the feature) rather than counting it as a failed attempt.
- Surface a distinct
blocked:rate_limit state, separate from genuine failures.
- (Phase-3 sandbox/chaos would harden this further — noted as a Phase-3 dependency.)
Acceptance criteria
Source: dogfood/ITERATION_REPORT.md MAJOR-6; dogfood/METRICS.md F4 note.
Summary
Foreman has no detection or backoff for provider rate limits. A transient
429/ session-limit response is treated as an ordinary run error, so it consumes features instead of pausing and resuming. This is an unattended-safety gap.Evidence (dogfood campaign)
api_error_status: 429/ "You've hit your session limit · resets 3:40am" at $0 cost.Why it matters
For an unattended run, a transient quota wall should pause and resume, not consume features. Misclassifying quota walls as failures also poisons the campaign metrics and the flywheel's failure clustering.
Proposed fix
api_error_status == 429/ session-limit in the stream result.blocked:rate_limitstate, separate from genuine failures.Acceptance criteria
429/session-limit result is recognized distinctly from a genuine run error.blocked:rate_limitstate is visible in the TUI and not counted as a feature failure.Source:
dogfood/ITERATION_REPORT.mdMAJOR-6;dogfood/METRICS.mdF4 note.