Skip to content

Fix select job recovery#390

Merged
daniel-thom merged 6 commits into
NatLabRockies:mainfrom
daniel-thom:daniel-thom/fix-recover-selective-reset
Jun 12, 2026
Merged

Fix select job recovery#390
daniel-thom merged 6 commits into
NatLabRockies:mainfrom
daniel-thom:daniel-thom/fix-recover-selective-reset

Conversation

@daniel-thom

Copy link
Copy Markdown
Collaborator

No description provided.

@daniel-thom daniel-thom requested a review from Copilot June 11, 2026 20:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes recovery behavior so “Skip” selections are respected when resetting failed jobs and applying resource corrections, and adds regression coverage to prevent workflow-wide resets from affecting unselected jobs.

Changes:

  • Update reset_failed_jobs to reset only the explicitly selected job IDs (instead of workflow-wide failed-job reset).
  • Add regression tests ensuring skipped jobs remain failed / unmodified during recover flows.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/test_workflow_manager.rs Adds regression test verifying reset_failed_jobs only resets selected jobs and does not bump run_id.
tests/test_correct_resources.rs Adds regression test ensuring include_jobs limits resource corrections to selected jobs only.
src/client/commands/recover.rs Reworks reset_failed_jobs implementation to reset per-job status via manage_status_change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/client/commands/recover.rs
Comment thread src/client/commands/recover.rs
Comment thread src/client/commands/recover.rs
@daniel-thom daniel-thom requested a review from Copilot June 11, 2026 21:08

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread src/client/commands/recover.rs
Comment thread src/client/commands/recover.rs Outdated
Comment thread tests/test_workflow_manager.rs Outdated
@daniel-thom daniel-thom requested a review from Copilot June 11, 2026 21:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

Comment thread src/client/commands/recover.rs
Comment thread src/client/commands/recover.rs Outdated
Comment thread src/client/commands/recover.rs Outdated
Comment thread src/client/commands/recover.rs
Comment thread src/client/commands/recover.rs
Comment thread tests/test_workflow_manager.rs Outdated
daniel-thom and others added 4 commits June 11, 2026 22:33
reset_failed_jobs called the workflow-wide reset_job_status(failed_only=true),
which ignored its job_ids argument and reset every failed job in the workflow.
A user who chose "Skip" for some failures (e.g. unknown-cause jobs) had those
jobs reset anyway, and the reported count reflected the selection rather than
what the server actually reset.

Reset only the requested job IDs via per-job manage_status_change(Uninitialized),
mirroring `torc jobs reset-status`. The current run_id is reused (no bump), so
the single run_id bump stays with the subsequent reinitialize. The returned
count now reflects the jobs actually reset.

Add test_reset_failed_jobs_only_resets_selected: fail three jobs, reset two,
assert the third stays Failed and run_id is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The TUI multi-reset (Space/'*' to mark jobs, then reset) dropped every
selected job that wasn't on the currently-loaded 250-job page:
request_selected_jobs_reset intersected the selection with the current page
(self.jobs) and silently discarded off-page selections. The reset itself was
never the limiter -- it shells out to `jobs reset-status <ids>`, which accepts
ids on any page, and selected_job_ids already persists across ]/[ paging.

- request_selected_jobs_reset now resets the entire selected_job_ids set, not
  just the current-page intersection, and drops the obsolete "not in the
  current view" warning. The completed-jobs warning is best-effort (only the
  loaded page's statuses are in memory), hedged with "at least N" when the
  selection extends beyond the loaded page.
- Clear selected_job_ids when the Jobs filter changes (apply_filter and
  clear_filter), keeping the selection scoped to the active filter so a
  cross-page reset only touches jobs in the current view -- mirroring the
  existing clear on workflow change.

The UI already shows [N selected] from the full set count, so the cross-page
total stays visible while paging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_recover_reset_sequence_bumps_run_id_once ran a job to Completed via
execute_workflow_with_job, then called reset_failed_jobs on it. With the new
recoverable_statuses allow-list, a Completed job is skipped, so reset_count==0
and reset_failed_jobs returns Err -- the test's .expect() panicked in CI.

Set the job up to fail directly (run -> Running -> complete_job(Failed))
instead of completing successfully, leaving it in a recoverable state. The
test only cares about run_id bookkeeping, not how the job reached a failed
state. (manage_status_change rejects completion statuses, and complete_job
rejects re-completing an already-Completed job, so the job must fail from the
start.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1. Non-interactive duplicate retries: a failed job whose only violation was
   CPU over-utilization was classified as an unknown-cause failure (it is
   neither OOM nor timeout) while ALSO being corrected and retried via
   apply_resource_corrections. With --retry-unknown it entered jobs_to_retry
   twice; the second reset attempt was skipped with a spurious "not
   recoverable" warning and other_failures/unknown_retried were inflated.
   "Unknown" now means no correctable violation at all, and jobs_to_retry is
   deduplicated as defense in depth (the wizard path already deduped).

2. Wizard CPU category: CPU-only violators were displayed under "Unknown
   Failures" and retried without a CPU correction, inconsistent with the
   non-interactive path. They now get their own category, table (allocated
   CPUs vs peak CPU%), retry/skip prompt, and a CPUs line in the recovery
   plan.

3. Canceled-job messaging: a workflow whose only recoverable jobs are
   canceled passes check_recovery_preconditions but never yields
   jobs_to_retry (diagnosis is built from results; canceled jobs usually
   have none), producing a misleading "No recoverable jobs found. 0 job(s)
   failed with unknown causes." Both the non-interactive error and the
   wizard's early return now point at 'torc jobs reset-status --reinit'
   when canceled jobs exist.

4. EOF infinite loop: prompt_line treated EOF (Ctrl-D) as an empty line, so
   the "Enter scheduler ID" loop — the only prompt without an empty-input
   default — re-prompted forever on a closed stdin. prompt_line now errors
   on EOF, aborting the wizard cleanly from any prompt.

Adds test_recovery_cpu_violation_not_counted_as_unknown covering fix 1: a
CPU-only violator is retried exactly once and not counted as unknown, while
a genuinely unknown failure (no resource metrics) still is.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@daniel-thom daniel-thom force-pushed the daniel-thom/fix-recover-selective-reset branch from 634763d to 31a7a89 Compare June 12, 2026 20:51
@daniel-thom daniel-thom requested a review from Copilot June 12, 2026 20:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment thread src/client/commands/recover.rs
Comment thread src/client/commands/recover.rs Outdated
Comment thread src/tui/app.rs Outdated
Comment thread tests/test_workflow_manager.rs Outdated
- Wizard categorization: also route likely_runtime_violation to the timeout
  bucket. It implies likely_timeout today (exec > 100% of runtime implies
  exec > 90%), but checking it explicitly keeps the wizard aligned with the
  non-interactive path if the diagnosis thresholds ever drift apart.
- reset_failed_jobs: deduplicate job_ids before the per-job loop so a
  repeated ID can't burn an extra get_job/manage_status_change round-trip
  and emit a misleading "not recoverable" skip for its duplicate.
- TUI: build a HashSet of loaded job IDs for the all_loaded check instead
  of an O(selected * loaded_page) nested scan.
- tests: use manager.get_run_id() instead of get_workflow(...).run_id
  .unwrap_or(1), which saved an API call and could mask unexpected run_id
  states; applied to both call sites for consistency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines +408 to +412
let mut msg = format!(
"No recoverable jobs found. {} job(s) failed with unknown causes. \
Use --retry-unknown to retry jobs with unknown failure causes.",
result.other_failures
));
);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in b0f59cf as "No auto-recoverable jobs found". Also applied the same wording to the dry-run line, the CLI summaries in main.rs, and watch's variant of the message so the phrasing is consistent everywhere this state is reported.

Comment thread src/client/commands/recover.rs
- "No recoverable jobs found" -> "No auto-recoverable jobs found": canceled/
  terminated/pending_failed jobs ARE recoverable by reset_failed_jobs; what
  the message actually reports is that the automatic heuristics selected
  nothing. Applied to the recover error, the dry-run line, the CLI
  summaries, and watch's variant of the same message.
- "Reset N failed job(s)" -> "Reset N job(s)": the reset allow-list also
  admits terminated, canceled, and pending_failed jobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@daniel-thom daniel-thom merged commit 5d92a06 into NatLabRockies:main Jun 12, 2026
4 checks passed
@daniel-thom daniel-thom deleted the daniel-thom/fix-recover-selective-reset branch June 12, 2026 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants