Fix select job recovery by daniel-thom · Pull Request #390 · NatLabRockies/torc

daniel-thom · 2026-06-11T20:13:31Z

No description provided.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes recovery behavior so “Skip” selections are respected when resetting failed jobs and applying resource corrections, and adds regression coverage to prevent workflow-wide resets from affecting unselected jobs.

Changes:

Update reset_failed_jobs to reset only the explicitly selected job IDs (instead of workflow-wide failed-job reset).
Add regression tests ensuring skipped jobs remain failed / unmodified during recover flows.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
tests/test_workflow_manager.rs	Adds regression test verifying `reset_failed_jobs` only resets selected jobs and does not bump `run_id`.
tests/test_correct_resources.rs	Adds regression test ensuring `include_jobs` limits resource corrections to selected jobs only.
src/client/commands/recover.rs	Reworks `reset_failed_jobs` implementation to reset per-job status via `manage_status_change`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

reset_failed_jobs called the workflow-wide reset_job_status(failed_only=true), which ignored its job_ids argument and reset every failed job in the workflow. A user who chose "Skip" for some failures (e.g. unknown-cause jobs) had those jobs reset anyway, and the reported count reflected the selection rather than what the server actually reset. Reset only the requested job IDs via per-job manage_status_change(Uninitialized), mirroring `torc jobs reset-status`. The current run_id is reused (no bump), so the single run_id bump stays with the subsequent reinitialize. The returned count now reflects the jobs actually reset. Add test_reset_failed_jobs_only_resets_selected: fail three jobs, reset two, assert the third stays Failed and run_id is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The TUI multi-reset (Space/'*' to mark jobs, then reset) dropped every selected job that wasn't on the currently-loaded 250-job page: request_selected_jobs_reset intersected the selection with the current page (self.jobs) and silently discarded off-page selections. The reset itself was never the limiter -- it shells out to `jobs reset-status <ids>`, which accepts ids on any page, and selected_job_ids already persists across ]/[ paging. - request_selected_jobs_reset now resets the entire selected_job_ids set, not just the current-page intersection, and drops the obsolete "not in the current view" warning. The completed-jobs warning is best-effort (only the loaded page's statuses are in memory), hedged with "at least N" when the selection extends beyond the loaded page. - Clear selected_job_ids when the Jobs filter changes (apply_filter and clear_filter), keeping the selection scoped to the active filter so a cross-page reset only touches jobs in the current view -- mirroring the existing clear on workflow change. The UI already shows [N selected] from the full set count, so the cross-page total stays visible while paging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test_recover_reset_sequence_bumps_run_id_once ran a job to Completed via execute_workflow_with_job, then called reset_failed_jobs on it. With the new recoverable_statuses allow-list, a Completed job is skipped, so reset_count==0 and reset_failed_jobs returns Err -- the test's .expect() panicked in CI. Set the job up to fail directly (run -> Running -> complete_job(Failed)) instead of completing successfully, leaving it in a recoverable state. The test only cares about run_id bookkeeping, not how the job reached a failed state. (manage_status_change rejects completion statuses, and complete_job rejects re-completing an already-Completed job, so the job must fail from the start.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

1. Non-interactive duplicate retries: a failed job whose only violation was CPU over-utilization was classified as an unknown-cause failure (it is neither OOM nor timeout) while ALSO being corrected and retried via apply_resource_corrections. With --retry-unknown it entered jobs_to_retry twice; the second reset attempt was skipped with a spurious "not recoverable" warning and other_failures/unknown_retried were inflated. "Unknown" now means no correctable violation at all, and jobs_to_retry is deduplicated as defense in depth (the wizard path already deduped). 2. Wizard CPU category: CPU-only violators were displayed under "Unknown Failures" and retried without a CPU correction, inconsistent with the non-interactive path. They now get their own category, table (allocated CPUs vs peak CPU%), retry/skip prompt, and a CPUs line in the recovery plan. 3. Canceled-job messaging: a workflow whose only recoverable jobs are canceled passes check_recovery_preconditions but never yields jobs_to_retry (diagnosis is built from results; canceled jobs usually have none), producing a misleading "No recoverable jobs found. 0 job(s) failed with unknown causes." Both the non-interactive error and the wizard's early return now point at 'torc jobs reset-status --reinit' when canceled jobs exist. 4. EOF infinite loop: prompt_line treated EOF (Ctrl-D) as an empty line, so the "Enter scheduler ID" loop — the only prompt without an empty-input default — re-prompted forever on a closed stdin. prompt_line now errors on EOF, aborting the wizard cleanly from any prompt. Adds test_recovery_cpu_violation_not_counted_as_unknown covering fix 1: a CPU-only violator is retried exactly once and not counted as unknown, while a genuinely unknown failure (no resource metrics) still is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

- Wizard categorization: also route likely_runtime_violation to the timeout bucket. It implies likely_timeout today (exec > 100% of runtime implies exec > 90%), but checking it explicitly keeps the wizard aligned with the non-interactive path if the diagnosis thresholds ever drift apart. - reset_failed_jobs: deduplicate job_ids before the per-job loop so a repeated ID can't burn an extra get_job/manage_status_change round-trip and emit a misleading "not recoverable" skip for its duplicate. - TUI: build a HashSet of loaded job IDs for the all_loaded check instead of an O(selected * loaded_page) nested scan. - tests: use manager.get_run_id() instead of get_workflow(...).run_id .unwrap_or(1), which saved an API call and could mask unexpected run_id states; applied to both call sites for consistency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

daniel-thom · 2026-06-12T22:18:42Z

+        let mut msg = format!(
            "No recoverable jobs found. {} job(s) failed with unknown causes. \
             Use --retry-unknown to retry jobs with unknown failure causes.",
            result.other_failures
-        ));
+        );


Good catch — fixed in b0f59cf as "No auto-recoverable jobs found". Also applied the same wording to the dry-run line, the CLI summaries in main.rs, and watch's variant of the message so the phrasing is consistent everywhere this state is reported.

- "No recoverable jobs found" -> "No auto-recoverable jobs found": canceled/ terminated/pending_failed jobs ARE recoverable by reset_failed_jobs; what the message actually reports is that the automatic heuristics selected nothing. Applied to the recover error, the dry-run line, the CLI summaries, and watch's variant of the same message. - "Reset N failed job(s)" -> "Reset N job(s)": the reset allow-list also admits terminated, canceled, and pending_failed jobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

daniel-thom requested a review from Copilot June 11, 2026 20:13

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread src/client/commands/recover.rs

Comment thread src/client/commands/recover.rs

Comment thread src/client/commands/recover.rs

daniel-thom requested a review from Copilot June 11, 2026 21:08

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread src/client/commands/recover.rs

Comment thread src/client/commands/recover.rs Outdated

Comment thread tests/test_workflow_manager.rs Outdated

daniel-thom requested a review from Copilot June 11, 2026 21:28

Copilot AI reviewed Jun 11, 2026

View reviewed changes

daniel-thom and others added 4 commits June 11, 2026 22:33

daniel-thom force-pushed the daniel-thom/fix-recover-selective-reset branch from 634763d to 31a7a89 Compare June 12, 2026 20:51

daniel-thom requested a review from Copilot June 12, 2026 20:52

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread src/client/commands/recover.rs

Comment thread src/client/commands/recover.rs Outdated

Comment thread src/tui/app.rs Outdated

Comment thread tests/test_workflow_manager.rs Outdated

Copilot started reviewing on behalf of daniel-thom June 12, 2026 21:04 View session

daniel-thom requested a review from Copilot June 12, 2026 21:58

Copilot started reviewing on behalf of daniel-thom June 12, 2026 21:59 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

daniel-thom merged commit 5d92a06 into NatLabRockies:main Jun 12, 2026
4 checks passed

daniel-thom deleted the daniel-thom/fix-recover-selective-reset branch June 12, 2026 23:23

Conversation

daniel-thom commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

daniel-thom Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants