Fix select job recovery#390
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes recovery behavior so “Skip” selections are respected when resetting failed jobs and applying resource corrections, and adds regression coverage to prevent workflow-wide resets from affecting unselected jobs.
Changes:
- Update
reset_failed_jobsto reset only the explicitly selected job IDs (instead of workflow-wide failed-job reset). - Add regression tests ensuring skipped jobs remain failed / unmodified during recover flows.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_workflow_manager.rs | Adds regression test verifying reset_failed_jobs only resets selected jobs and does not bump run_id. |
| tests/test_correct_resources.rs | Adds regression test ensuring include_jobs limits resource corrections to selected jobs only. |
| src/client/commands/recover.rs | Reworks reset_failed_jobs implementation to reset per-job status via manage_status_change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
reset_failed_jobs called the workflow-wide reset_job_status(failed_only=true), which ignored its job_ids argument and reset every failed job in the workflow. A user who chose "Skip" for some failures (e.g. unknown-cause jobs) had those jobs reset anyway, and the reported count reflected the selection rather than what the server actually reset. Reset only the requested job IDs via per-job manage_status_change(Uninitialized), mirroring `torc jobs reset-status`. The current run_id is reused (no bump), so the single run_id bump stays with the subsequent reinitialize. The returned count now reflects the jobs actually reset. Add test_reset_failed_jobs_only_resets_selected: fail three jobs, reset two, assert the third stays Failed and run_id is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The TUI multi-reset (Space/'*' to mark jobs, then reset) dropped every selected job that wasn't on the currently-loaded 250-job page: request_selected_jobs_reset intersected the selection with the current page (self.jobs) and silently discarded off-page selections. The reset itself was never the limiter -- it shells out to `jobs reset-status <ids>`, which accepts ids on any page, and selected_job_ids already persists across ]/[ paging. - request_selected_jobs_reset now resets the entire selected_job_ids set, not just the current-page intersection, and drops the obsolete "not in the current view" warning. The completed-jobs warning is best-effort (only the loaded page's statuses are in memory), hedged with "at least N" when the selection extends beyond the loaded page. - Clear selected_job_ids when the Jobs filter changes (apply_filter and clear_filter), keeping the selection scoped to the active filter so a cross-page reset only touches jobs in the current view -- mirroring the existing clear on workflow change. The UI already shows [N selected] from the full set count, so the cross-page total stays visible while paging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_recover_reset_sequence_bumps_run_id_once ran a job to Completed via execute_workflow_with_job, then called reset_failed_jobs on it. With the new recoverable_statuses allow-list, a Completed job is skipped, so reset_count==0 and reset_failed_jobs returns Err -- the test's .expect() panicked in CI. Set the job up to fail directly (run -> Running -> complete_job(Failed)) instead of completing successfully, leaving it in a recoverable state. The test only cares about run_id bookkeeping, not how the job reached a failed state. (manage_status_change rejects completion statuses, and complete_job rejects re-completing an already-Completed job, so the job must fail from the start.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1. Non-interactive duplicate retries: a failed job whose only violation was CPU over-utilization was classified as an unknown-cause failure (it is neither OOM nor timeout) while ALSO being corrected and retried via apply_resource_corrections. With --retry-unknown it entered jobs_to_retry twice; the second reset attempt was skipped with a spurious "not recoverable" warning and other_failures/unknown_retried were inflated. "Unknown" now means no correctable violation at all, and jobs_to_retry is deduplicated as defense in depth (the wizard path already deduped). 2. Wizard CPU category: CPU-only violators were displayed under "Unknown Failures" and retried without a CPU correction, inconsistent with the non-interactive path. They now get their own category, table (allocated CPUs vs peak CPU%), retry/skip prompt, and a CPUs line in the recovery plan. 3. Canceled-job messaging: a workflow whose only recoverable jobs are canceled passes check_recovery_preconditions but never yields jobs_to_retry (diagnosis is built from results; canceled jobs usually have none), producing a misleading "No recoverable jobs found. 0 job(s) failed with unknown causes." Both the non-interactive error and the wizard's early return now point at 'torc jobs reset-status --reinit' when canceled jobs exist. 4. EOF infinite loop: prompt_line treated EOF (Ctrl-D) as an empty line, so the "Enter scheduler ID" loop — the only prompt without an empty-input default — re-prompted forever on a closed stdin. prompt_line now errors on EOF, aborting the wizard cleanly from any prompt. Adds test_recovery_cpu_violation_not_counted_as_unknown covering fix 1: a CPU-only violator is retried exactly once and not counted as unknown, while a genuinely unknown failure (no resource metrics) still is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
634763d to
31a7a89
Compare
- Wizard categorization: also route likely_runtime_violation to the timeout bucket. It implies likely_timeout today (exec > 100% of runtime implies exec > 90%), but checking it explicitly keeps the wizard aligned with the non-interactive path if the diagnosis thresholds ever drift apart. - reset_failed_jobs: deduplicate job_ids before the per-job loop so a repeated ID can't burn an extra get_job/manage_status_change round-trip and emit a misleading "not recoverable" skip for its duplicate. - TUI: build a HashSet of loaded job IDs for the all_loaded check instead of an O(selected * loaded_page) nested scan. - tests: use manager.get_run_id() instead of get_workflow(...).run_id .unwrap_or(1), which saved an API call and could mask unexpected run_id states; applied to both call sites for consistency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| let mut msg = format!( | ||
| "No recoverable jobs found. {} job(s) failed with unknown causes. \ | ||
| Use --retry-unknown to retry jobs with unknown failure causes.", | ||
| result.other_failures | ||
| )); | ||
| ); |
There was a problem hiding this comment.
Good catch — fixed in b0f59cf as "No auto-recoverable jobs found". Also applied the same wording to the dry-run line, the CLI summaries in main.rs, and watch's variant of the message so the phrasing is consistent everywhere this state is reported.
- "No recoverable jobs found" -> "No auto-recoverable jobs found": canceled/ terminated/pending_failed jobs ARE recoverable by reset_failed_jobs; what the message actually reports is that the automatic heuristics selected nothing. Applied to the recover error, the dry-run line, the CLI summaries, and watch's variant of the same message. - "Reset N failed job(s)" -> "Reset N job(s)": the reset allow-list also admits terminated, canceled, and pending_failed jobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No description provided.