Skip to content

[scheduler] Add scheduling mode: E-PVM#2398

Open
DiegoTavares wants to merge 5 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:sched_e-pvm
Open

[scheduler] Add scheduling mode: E-PVM#2398
DiegoTavares wants to merge 5 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:sched_e-pvm

Conversation

@DiegoTavares

@DiegoTavares DiegoTavares commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Add new scheduling mode to scheduler module that implements E-PVM, inspired by the work of Aghiles' scheduler.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added per-show scheduler ownership control via cueadmin -scheduler-managed <show> on|off command
    • Added scheduler_managed property to shows via gRPC API and Python wrappers
    • Introduced Redis-backed accounting for scheduler-managed shows
  • Configuration

    • Added Redis accounting configuration options (accounting.redis.enabled, accounting.redis.host, accounting.redis.port)
    • Extended scheduler queue configuration with new timing and sizing parameters
  • Documentation

    • Updated deployment guides and technical references for per-show scheduler ownership model
  • Version

    • Bumped version from 1.24 to 1.25

Summary by CodeRabbit

Release Notes

  • New Features

    • Integrated Redis-based resource accounting system for enhanced booking and limit management
    • Added E-PVM (energy-performance-value-matrix) host selection strategy alongside existing saturation-based selection
    • Enhanced cluster scheduling with round-trip tracking and intelligent sleep management
  • Improvements

    • Expanded host selection with profile-aware placement and configurable scoring weights
    • Added new metrics for resource accounting limits and placement performance monitoring
    • Enhanced scheduler configuration with new queue timing, dispatch caps, and resource reservation parameters

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4bcff7e-6bd2-4b33-aec7-709519ddb18a

📥 Commits

Reviewing files that changed from the base of the PR and between eac7ebf and 3147111.

📒 Files selected for processing (1)
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatchQuery.java
✅ Files skipped from review due to trivial changes (1)
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatchQuery.java

📝 Walkthrough

Walkthrough

The PR integrates Redis-backed resource accounting and placement strategy scoring into the scheduler, replacing prior resource accounting services. Facility identifiers become strings throughout the data model. The host cache checkout API changes to accept placement profiles and scoring gates, supporting both saturation (first-fit) and E-PVM (lowest-score) strategies. Cluster feed streaming is refactored with control-flow resilience and round-trip timing. Matcher logic now performs accounting prechecks and constructs placement contexts for strategy-aware host selection.

Changes

Scheduler-managed accounting and placement integration

Layer / File(s) Summary
Configuration and booking strategy contracts
rust/crates/scheduler/src/config/mod.rs, rust/config/scheduler.yaml, rust/crates/scheduler/Cargo.toml
New AccountingConfig and RedisConfig with cadence/TTL fields; QueueConfig adds cluster_empty_sleep, dispatch caps, and memory reservation knobs; HostBookingStrategy refactored into tagged enum with Saturation and Epvm variants; ScoreWeights struct introduced for E-PVM scoring parameters; YAML docs expanded to describe booking strategies and configuration defaults.
Facility/tag/layer data shape and DAO transitions
rust/crates/scheduler/src/cluster.rs, rust/crates/scheduler/src/cluster_key.rs, rust/crates/scheduler/src/models/layer.rs, rust/crates/scheduler/src/dao/cluster_dao.rs, rust/crates/scheduler/src/dao/layer_dao.rs
Cluster.facility_id, Tag, and ClusterKey switch to string-based facility IDs; Tag adds optional alloc_id field; DispatchLayer adds folder_id, dept_id, and job_max_cores fields; cluster loading logic updates tag aggregation keying and alloc_id population for ALLOC-type tags; DAO queries select/project facility and job-cap fields; get_facility_id returns String directly from DB instead of parsing Uuid.
Redis accounting client and Lua result parsing
rust/crates/scheduler/src/accounting/redis_client.rs
New async Redis client RedisAccounting with ReseedOp, BookOutcome, and BookMode types; implements connect, book, reseed_cas, get_seq, read_sub_counters, read_job_cores_in_use operations; strict parse_book_outcome validator ensures Lua BOOK_OR_FORCE responses match expected shapes (Applied or LimitExceeded).
Placement profile, scoring, and gates
rust/crates/scheduler/src/pipeline/placement.rs, rust/crates/scheduler/src/pipeline/mod.rs
New module defining LayerProfile (layer snapshot with floor/compatibility and E-PVM context); host-matching helpers for OS and thread-mode validation; fits_floor predicate checking idle resources; compute_max_more headroom computation; placement_score using W3 normalization with configurable weights; saturation_gate (first-fit validate-only) and epvm_gate (validate + score); comprehensive unit tests and proptest properties for scoring monotonicity and non-negativity.
Host cache checkout API and strategy execution
rust/crates/scheduler/src/host_cache/cache.rs, rust/crates/scheduler/src/host_cache/actor.rs, rust/crates/scheduler/src/host_cache/messages.rs, rust/crates/scheduler/src/host_cache/mod.rs
CheckOut message becomes non-generic, carrying profile and gate instead of generic validation closure; facility ID becomes String; HostCache::check_out dispatches to Saturation (first-fit) or Epvm (K-best candidate scoring) strategies; new ScoredCandidate struct and remove_host_best implementation for E-PVM; actor adds refresh_in_progress RAII guard for concurrency control; CacheRatioResponse simplified to hit_ratio only; reservation-aware checkout with expired reservation treatment; test suite updated with profile/gate helpers and new Epvm-specific tests.
Cluster feed and matcher runtime orchestration
rust/crates/scheduler/src/cluster.rs, rust/crates/scheduler/src/pipeline/entrypoint.rs, rust/crates/scheduler/src/pipeline/matcher.rs, rust/crates/scheduler/src/metrics/mod.rs
Cluster feed producer/receiver wrapped with AssertUnwindSafe/catch_unwind and ControlFlow for panic resilience; last_sent_map tracks round-trip timestamps with observe_cluster_round_trip metrics; receiver processes FeedMessage::Sleep and FeedMessage::Stop cleanly; entrypoint initializes accounting service and bootstrap reseed; matcher adds feed_sender parameter for cluster sleep signaling; process_layer performs Redis snapshot reads (job cores, subscription counters), derives allocation context, computes burst/cores-in-use, executes pre-checkout burst checks; host checkout constructs LayerProfile, selects strategy gate, tracks frame consumption and updates booked counters; new metrics for accounting-limit rejections, cluster round-trip, placement score, and E-PVM inner-retries.
Supporting scheduler tests and Cuebot doc note
rust/crates/scheduler/tests/util.rs, cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatchQuery.java
Test utilities updated: create_test_config adds cluster_empty_sleep and AccountingConfig::default(); create_test_data passes facility ID as string and explicitly populates tag alloc_id fields (None for manual/hostname/hardware, Some for allocation tags); Cuebot DispatchQuery.FIND_SHOWS javadoc clarifies that scheduler-managed shows are excluded because dispatch is owned by standalone Rust scheduler.

Sequence Diagram(s)

sequenceDiagram
  participant Matcher as Matcher Layer<br/>Processing
  participant Redis as Redis<br/>Accounting
  participant HostCache as Host Cache<br/>Actor
  participant HostDB as Host DB<br/>Lookup
  
  Matcher->>Redis: read job cores_in_use
  Redis-->>Matcher: job_cores_in_use (or 0)
  Matcher->>Redis: read show/alloc subscription
  Redis-->>Matcher: (int_cores, burst)
  Matcher->>Matcher: compute profile (burst,<br/>cores_in_use, weights)
  Matcher->>Matcher: select Saturation<br/>or Epvm strategy
  alt pre-checkout burst limit exceeded
    Matcher->>Matcher: emit FeedMessage::Sleep
    Matcher->>Redis: increment<br/>accounting_limit_exceeded
    Note over Matcher: Early return, no host checkout
  else proceed to checkout
    Matcher->>HostCache: send CheckOut<br/>(profile, gate, facility_id)
    HostCache->>HostDB: fetch candidate hosts<br/>by show/facility/tag
    HostDB-->>HostCache: candidate list
    alt Saturation strategy
      HostCache->>HostCache: first valid host<br/>gate(host, profile)=Some()
    else Epvm strategy
      HostCache->>HostCache: K-best by lowest score<br/>min(gate(host, profile))
      HostCache->>HostCache: CAS removal with retries
      Note over HostCache: Record inner-retries metric
    end
    HostCache-->>Matcher: Host or error
    Matcher->>Matcher: dispatch frames,<br/>track consumption
    Matcher->>Matcher: update booked counters<br/>for next iteration
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • AcademySoftwareFoundation/OpenCue#2323: Both PRs touch cuebot/.../dao/postgres/DispatchQuery.FIND_SHOWS to ensure scheduler-managed shows are excluded from Cuebot booking/dispatch.
  • AcademySoftwareFoundation/OpenCue#2198: This PR implements scheduler-side (Redis-backed) resource accounting and scheduler-managed-show dispatch, continuing the shift away from Cuebot's per-dispatch resource table updates.
  • AcademySoftwareFoundation/OpenCue#2328: Both PRs change scheduler layer-to-host matching logic; this PR introduces placement primitives including host_matches_thread_mode/validate_os_and_thread_mode.

Suggested reviewers

  • lithorus
  • ramonfigueiredo

Poem

🐰 A leap through Redis clouds we go,
Where placement scores and strategies flow,
Saturation first, E-PVM best,
The scheduler's booking put to the test!
Facility strings and profiles bright,
Host cache now chooses with weighted might.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[scheduler] Add scheduling mode: E-PVM' directly and clearly summarizes the main change. The PR adds a new E-PVM scheduling mode to the scheduler, which is the primary objective.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (4)
rust/crates/scheduler/src/config/mod.rs (1)

101-105: 💤 Low value

Consider adding authentication support to Redis URL construction.

The url() method constructs a basic redis://host:port/ URL without authentication. If production deployments require Redis AUTH, this method may need enhancement to support username/password or connection string overrides.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/src/config/mod.rs` around lines 101 - 105, The
RedisConfig::url() method currently returns "redis://host:port/" with no auth
support; update RedisConfig (struct) to include optional credentials (e.g.,
username: Option<String>, password: Option<String>) or an optional
raw_connection_string: Option<String>, and modify RedisConfig::url() to: if
raw_connection_string is Some return it; otherwise build the URL including
"redis://username:password@host:port/" when credentials are present (omit
username if only password needed per Redis URI rules) and percent-encode
credentials as necessary to avoid malformed URLs. Ensure you reference the
RedisConfig struct and the url() method so callers keep backwards compatibility
by keeping fields optional.
rust/crates/scheduler/src/pipeline/matcher.rs (1)

306-312: ⚡ Quick win

Handle FeedMessage::Sleep send failures explicitly.

Ignoring the send result hides closed-channel failures; then this over-burst path returns without applying any backoff locally.

Suggested patch
-            let _ = feed_sender
-                .send(FeedMessage::Sleep(
-                    (*cluster).clone(),
-                    CONFIG.queue.cluster_empty_sleep,
-                ))
-                .await;
+            if let Err(err) = feed_sender
+                .send(FeedMessage::Sleep(
+                    (*cluster).clone(),
+                    CONFIG.queue.cluster_empty_sleep,
+                ))
+                .await
+            {
+                debug!(
+                    "Failed to enqueue cluster sleep for {} (over-burst path): {}",
+                    cluster, err
+                );
+                tokio::time::sleep(CONFIG.queue.cluster_empty_sleep).await;
+            }
             return;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/src/pipeline/matcher.rs` around lines 306 - 312, The
code currently ignores the result of
feed_sender.send(FeedMessage::Sleep((*cluster).clone(),
CONFIG.queue.cluster_empty_sleep)).await; change this to handle the Result
explicitly: if send returns Ok proceed as before, but if Err (channel closed)
log the failure (include context like cluster and that FeedMessage::Sleep
failed) and apply a local backoff before returning (e.g., await
tokio::time::sleep(Duration::from_millis(CONFIG.queue.cluster_empty_sleep)) or
equivalent) so the over-burst path does not immediately return without backoff;
update matcher.rs around the feed_sender send call to match on the Result and
handle Err accordingly.
rust/crates/scheduler/src/dao/resource_accounting_dao.rs (1)

149-152: 💤 Low value

Redundant p.pk_show = ANY($1) filter in point recompute query.

Line 151 applies p.pk_show = ANY($1) in the LEFT JOIN condition, but p.pk_show = pt2.pk_show is already constrained by the join, and pt2.pk_show = ANY($1) at line 152 already filters the point rows. The filter at line 151 is redundant.

Not a correctness issue, but removing it would simplify the query slightly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/src/dao/resource_accounting_dao.rs` around lines 149 -
152, Remove the redundant p.pk_show = ANY($1) predicate from the LEFT JOIN on
proc (alias p) in the point recompute SQL: the join already enforces p.pk_show =
pt2.pk_show and the WHERE clause pt2.pk_show = ANY($1) filters pt2, so delete
the p.pk_show = ANY($1) condition in the LEFT JOIN to simplify the query; update
the SQL string in resource_accounting_dao.rs where the LEFT JOIN proc p ... AND
p.pk_show = ANY($1) appears and ensure spacing/AND chaining remains valid after
removal.
pycue/tests/wrappers/test_show.py (1)

191-201: ⚡ Quick win

Cover both toggle directions in the RPC assertion.

testSetSchedulerManaged currently validates only enabled=True. Adding enabled=False in the same test tightens contract coverage for the new toggle RPC with minimal effort.

♻️ Suggested test update
     def testSetSchedulerManaged(self, getStubMock):
         stubMock = mock.Mock()
         stubMock.SetSchedulerManaged.return_value = show_pb2.ShowSetSchedulerManagedResponse()
         getStubMock.return_value = stubMock
 
         show = opencue.wrappers.show.Show(show_pb2.Show(name=TEST_SHOW_NAME))
-        show.setSchedulerManaged(True)
-
-        stubMock.SetSchedulerManaged.assert_called_with(
-            show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=True),
-            timeout=mock.ANY)
+        show.setSchedulerManaged(True)
+        show.setSchedulerManaged(False)
+
+        stubMock.SetSchedulerManaged.assert_any_call(
+            show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=True),
+            timeout=mock.ANY)
+        stubMock.SetSchedulerManaged.assert_any_call(
+            show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=False),
+            timeout=mock.ANY)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pycue/tests/wrappers/test_show.py` around lines 191 - 201, The test only
asserts the RPC for enabling the scheduler; update testSetSchedulerManaged to
exercise both toggle directions by invoking show.setSchedulerManaged(True) and
show.setSchedulerManaged(False) (or call once and then again) and assert the
stubMock.SetSchedulerManaged was called for both cases using either two
assert_called_with checks referencing
show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=True) and
...enabled=False) or a single assert_has_calls with both expected requests;
target symbols: testSetSchedulerManaged, show.setSchedulerManaged,
stubMock.SetSchedulerManaged, and ShowSetSchedulerManagedRequest.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cuebot/build.gradle`:
- Around line 41-44: Update the Lettuce dependency to a non-vulnerable release
by changing the dependency declaration for io.lettuce:lettuce-core (currently
'6.3.2.RELEASE') to at least '6.5.1.RELEASE' (preferably the latest, e.g.
'6.7.1.RELEASE'); keep the explicit reactor-core override for
io.projectreactor:reactor-core at the 3.6.x line (e.g. '3.6.5') so Lettuce
compatibility is preserved and remove/replace the old version reference wherever
io.lettuce:lettuce-core '6.3.2.RELEASE' is declared.

In
`@cuebot/src/main/resources/conf/ddl/postgres/migrations/V44__scheduler_pending_query_indexes.sql`:
- Line 20: The manual backfill example in
V44__scheduler_pending_query_indexes.sql incorrectly inserts version 40 into
flyway_schema_history; update the VALUES tuple to use 44 to match this migration
(ensure the INSERT INTO flyway_schema_history(... ) VALUES (..., 44, ... )
reflects the V44__scheduler_pending_query_indexes.sql migration) so the
flyway_schema_history row matches the migration version and prevents poisoning
subsequent runs.

In `@docs/_docs/developer-guide/redis-accounting.md`:
- Around line 247-253: The two unlabeled fenced code blocks that contain the
numbered acct booking steps and the cueadmin command examples should be given
explicit languages to satisfy markdownlint MD040: change the fence wrapping the
numbered list (the block starting "1. Read current state of acct:sub /
acct:folder / acct:job / acct:layer / acct:point") to use a text label (e.g.,
```text) and change the fence wrapping the cueadmin examples (the block with
"cueadmin -show <name> -setSchedulerManaged true/false") to use a shell label
(e.g., ```bash); update both occurrences (the one around the acct steps and the
one around the cueadmin commands) accordingly.
- Line 14: The heading "How Cuebot and the Rust scheduler coordinate per-show
resource accounting through Redis" is using ### causing a level jump from # to
### and failing MD001; change that heading token from "###" to "##" so it
follows the top-level "#" and fixes the markdown lint error (update the heading
line where that exact text appears).

In `@docs/_docs/developer-guide/scheduler.md`:
- Around line 648-654: The fenced code block listing the scheduler steps
(starting "1. Read current state of acct:sub / acct:folder / acct:job /
acct:layer / acct:point" through "5. Then transactionally INSERT proc in
Postgres (outside Lua)") is unlabeled and triggers MD040; add a language label
to the opening triple-backtick (e.g., ```text) to satisfy the linter and
preserve formatting so the block is treated as a code/preformatted block.

In `@docs/_docs/getting-started/deploying-scheduler.md`:
- Around line 262-265: The docs contain inconsistent cueadmin usage: keep the
correct form "cueadmin -scheduler-managed <SHOW> ON|OFF" and replace any
incorrect examples using "-show <name> -setSchedulerManaged true|false" (look
for occurrences of the flags -setSchedulerManaged and -show in docs, e.g.,
developer-guide scheduler and redis-accounting sections); update those examples
to use the exact flag and value format "-scheduler-managed <SHOW> ON" or "OFF"
so all docs match the working CLI syntax.

In `@pycue/opencue/wrappers/show.py`:
- Around line 223-235: setSchedulerManaged currently calls
stub.SetSchedulerManaged but doesn't update the wrapper's local state, so
subsequent schedulerManaged() reads can be stale; after calling
SetSchedulerManaged (in setSchedulerManaged), assign the returned value to
self.data.scheduler_managed (or set self.data.scheduler_managed = value if the
RPC returns no body) using the response from stub.SetSchedulerManaged to
synchronize the wrapper, ensuring schedulerManaged() reflects the new state.

In `@rust/crates/scheduler/src/accounting/dao.rs`:
- Around line 146-157: The query_managed_show_ids function currently maps DB
rows using parse_uuid which unwraps and will panic on malformed UUIDs; change
the handling to use a fallible parse instead: update parse_uuid (in
dao/helpers.rs) to return Result<Uuid, Report> or provide a new try_parse_uuid
that calls Uuid::parse_str and returns an Err on failure, then modify
query_managed_show_ids (and other callers) to collect a Result<Vec<Uuid>, _> or
filter/log and skip invalid rows rather than calling unwrap—propagate the error
with wrap_err for diagnostics or log and skip malformed rows so a single corrupt
DB value cannot crash the scheduler.

In `@rust/crates/scheduler/src/accounting/recompute.rs`:
- Around line 46-52: The current recompute task exits permanently if
ResourceAccountingDao::new().await returns Err, causing a long-lived outage;
change the initialization to retry instead of returning: wrap the call to
ResourceAccountingDao::new().await (the code creating pg_dao/Arc::new(d)) in a
retry loop with exponential backoff (or fixed delay) and error logging, continue
retrying until success, and only proceed to create pg_dao and enter the
recompute loop after a successful Result; ensure any temporary failures do not
cause an early return from the recompute task.

In `@rust/crates/scheduler/src/accounting/redis_client.rs`:
- Around line 142-144: Update the unit wording in the Redis client docs: change
"centicores" to "cores" in the doc comment that begins "Reads the subscription
hash's booked cores + burst in one round-trip..." and the other similar doc
comment around the subsequent block (the occurrences currently at the same doc
block and the one at lines ~161-162). Ensure both doc comments and any inline
mention in redis_client.rs consistently state that Redis counters operate in
cores (not centicores) so the Lua contract and callers are not misled.
- Line 155: The call currently uses conn.hget(&key, &["int_cores",
"burst"]).await? which is a single-field read API but is passing multiple
fields; change this to the multi-field API conn.hmget(&key, &["int_cores",
"burst"]).await? (keeping the expected type Vec<Option<i64>> and the surrounding
error handling) so the Redis client sends HMGET for the fields "int_cores" and
"burst" and maps into values as before.

In `@rust/crates/scheduler/src/cluster.rs`:
- Around line 418-423: The code can panic when current_index_atomic contains an
out-of-range value after self.clusters shrinks; before indexing clusters use the
current clusters length to clamp the index (and handle empty clusters).
Specifically, in the block around current_index_atomic.load, get let len =
clusters.len(); if len == 0 return/skip; compute a safe_index = current_index %
len (or safe_index = min(current_index, len-1)), then read item =
clusters[safe_index].clone(); compute next_index = (safe_index + 1) % len and
store that back into current_index_atomic; this prevents indexing with stale,
too-large indices while preserving wrap-around detection and forward progress.

In `@rust/crates/scheduler/src/models/core_size.rs`:
- Around line 68-71: The CoreSize conversion functions
(CoreSize::from_multiplied and CoreSize::from_multiplied_cap) perform unchecked
i64 -> i32 narrowing using `as i32`, which can silently wrap; change them to use
a checked conversion (e.g., try_into / i32::try_from) or explicit saturation
with bounds checking against i32::MIN/i32::MAX and handle failures by returning
a Result or clamping to safe limits. Update the functions that reference
CONFIG.queue.core_multiplier to compute the intermediate i64, check for
overflow/underflow before converting, and either return an error (propagate
Result) or clamp to i32 bounds and document the behavior so no silent wrapping
occurs.

In `@rust/crates/scheduler/src/pipeline/dispatcher/actor.rs`:
- Around line 482-491: BookingDelta is being built with DispatchLayer.folder_id
and dept_id that may be Uuid::nil(), which would produce incorrect
acct:folder:{folder_id} and acct:point:{dept_id}:{show_id} Redis keys; add an
explicit nil-UUID guard before constructing BookingDelta (best place: inside
dispatch_virtual_proc right before creating BookingDelta or at the callsite in
dispatch_inner/consume_host_virtual_resources) that checks
DispatchLayer.folder_id.is_nil() and DispatchLayer.dept_id.is_nil() and
rejects/returns an Err (or skips booking) with a logged error; ensure the check
references the DispatchLayer fields and prevents creating BookingDelta when
either UUID is nil so no acct:folder or acct:point keys are written for the zero
UUID.

---

Nitpick comments:
In `@pycue/tests/wrappers/test_show.py`:
- Around line 191-201: The test only asserts the RPC for enabling the scheduler;
update testSetSchedulerManaged to exercise both toggle directions by invoking
show.setSchedulerManaged(True) and show.setSchedulerManaged(False) (or call once
and then again) and assert the stubMock.SetSchedulerManaged was called for both
cases using either two assert_called_with checks referencing
show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=True) and
...enabled=False) or a single assert_has_calls with both expected requests;
target symbols: testSetSchedulerManaged, show.setSchedulerManaged,
stubMock.SetSchedulerManaged, and ShowSetSchedulerManagedRequest.

In `@rust/crates/scheduler/src/config/mod.rs`:
- Around line 101-105: The RedisConfig::url() method currently returns
"redis://host:port/" with no auth support; update RedisConfig (struct) to
include optional credentials (e.g., username: Option<String>, password:
Option<String>) or an optional raw_connection_string: Option<String>, and modify
RedisConfig::url() to: if raw_connection_string is Some return it; otherwise
build the URL including "redis://username:password@host:port/" when credentials
are present (omit username if only password needed per Redis URI rules) and
percent-encode credentials as necessary to avoid malformed URLs. Ensure you
reference the RedisConfig struct and the url() method so callers keep backwards
compatibility by keeping fields optional.

In `@rust/crates/scheduler/src/dao/resource_accounting_dao.rs`:
- Around line 149-152: Remove the redundant p.pk_show = ANY($1) predicate from
the LEFT JOIN on proc (alias p) in the point recompute SQL: the join already
enforces p.pk_show = pt2.pk_show and the WHERE clause pt2.pk_show = ANY($1)
filters pt2, so delete the p.pk_show = ANY($1) condition in the LEFT JOIN to
simplify the query; update the SQL string in resource_accounting_dao.rs where
the LEFT JOIN proc p ... AND p.pk_show = ANY($1) appears and ensure spacing/AND
chaining remains valid after removal.

In `@rust/crates/scheduler/src/pipeline/matcher.rs`:
- Around line 306-312: The code currently ignores the result of
feed_sender.send(FeedMessage::Sleep((*cluster).clone(),
CONFIG.queue.cluster_empty_sleep)).await; change this to handle the Result
explicitly: if send returns Ok proceed as before, but if Err (channel closed)
log the failure (include context like cluster and that FeedMessage::Sleep
failed) and apply a local backoff before returning (e.g., await
tokio::time::sleep(Duration::from_millis(CONFIG.queue.cluster_empty_sleep)) or
equivalent) so the over-burst path does not immediately return without backoff;
update matcher.rs around the feed_sender send call to match on the Result and
handle Err accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a7aa165-b9e9-4987-a82f-97ec0dd9c477

📥 Commits

Reviewing files that changed from the base of the PR and between e504bc9 and eac7ebf.

📒 Files selected for processing (82)
  • .gitignore
  • VERSION.in
  • cueadmin/cueadmin/common.py
  • cueadmin/tests/test_common.py
  • cuebot/build.gradle
  • cuebot/src/main/java/com/imageworks/spcue/ShowEntity.java
  • cuebot/src/main/java/com/imageworks/spcue/VirtualProc.java
  • cuebot/src/main/java/com/imageworks/spcue/config/AppConfig.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/ShowDao.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatchQuery.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatcherDaoJdbc.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/ProcDaoJdbc.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/ShowDaoJdbc.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/WhiteboardDaoJdbc.java
  • cuebot/src/main/java/com/imageworks/spcue/servant/ManageShow.java
  • cuebot/src/main/java/com/imageworks/spcue/service/AccountingRedisPublisher.java
  • cuebot/src/main/java/com/imageworks/spcue/service/LettuceAccountingRedisPublisher.java
  • cuebot/src/main/resources/conf/ddl/postgres/migrations/V44__scheduler_pending_query_indexes.sql
  • cuebot/src/main/resources/conf/ddl/postgres/migrations/V45__show_scheduler_managed.sql
  • cuebot/src/main/resources/conf/spring/applicationContext-accounting.xml
  • cuebot/src/main/resources/opencue.properties
  • cuebot/src/test/java/com/imageworks/spcue/config/TestAppConfig.java
  • cuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/DispatcherDaoTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/ProcDaoTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/ShowDaoTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/servant/ManageShowTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/service/LettuceAccountingRedisPublisherTests.java
  • docs/_docs/developer-guide/redis-accounting.md
  • docs/_docs/developer-guide/scheduler.md
  • docs/_docs/getting-started/deploying-scheduler.md
  • docs/news/2025-12-12-distributed-scheduler-release.md
  • proto/src/show.proto
  • pycue/opencue/wrappers/show.py
  • pycue/tests/wrappers/test_show.py
  • rust/config/scheduler.yaml
  • rust/crates/scheduler/Cargo.toml
  • rust/crates/scheduler/src/accounting/booking_delta.rs
  • rust/crates/scheduler/src/accounting/bootstrap.rs
  • rust/crates/scheduler/src/accounting/dao.rs
  • rust/crates/scheduler/src/accounting/error.rs
  • rust/crates/scheduler/src/accounting/limit_reseed.rs
  • rust/crates/scheduler/src/accounting/lua.rs
  • rust/crates/scheduler/src/accounting/managed_shows.rs
  • rust/crates/scheduler/src/accounting/mod.rs
  • rust/crates/scheduler/src/accounting/recompute.rs
  • rust/crates/scheduler/src/accounting/redis_client.rs
  • rust/crates/scheduler/src/cluster.rs
  • rust/crates/scheduler/src/cluster_key.rs
  • rust/crates/scheduler/src/config/mod.rs
  • rust/crates/scheduler/src/dao/cluster_dao.rs
  • rust/crates/scheduler/src/dao/frame_dao.rs
  • rust/crates/scheduler/src/dao/host_dao.rs
  • rust/crates/scheduler/src/dao/job_dao.rs
  • rust/crates/scheduler/src/dao/layer_dao.rs
  • rust/crates/scheduler/src/dao/mod.rs
  • rust/crates/scheduler/src/dao/proc_dao.rs
  • rust/crates/scheduler/src/dao/resource_accounting_dao.rs
  • rust/crates/scheduler/src/host_cache/actor.rs
  • rust/crates/scheduler/src/host_cache/cache.rs
  • rust/crates/scheduler/src/host_cache/messages.rs
  • rust/crates/scheduler/src/host_cache/mod.rs
  • rust/crates/scheduler/src/lib.rs
  • rust/crates/scheduler/src/main.rs
  • rust/crates/scheduler/src/metrics/mod.rs
  • rust/crates/scheduler/src/models/core_size.rs
  • rust/crates/scheduler/src/models/frame.rs
  • rust/crates/scheduler/src/models/layer.rs
  • rust/crates/scheduler/src/models/mod.rs
  • rust/crates/scheduler/src/models/subscription.rs
  • rust/crates/scheduler/src/models/virtual_proc.rs
  • rust/crates/scheduler/src/pipeline/dispatcher/actor.rs
  • rust/crates/scheduler/src/pipeline/dispatcher/error.rs
  • rust/crates/scheduler/src/pipeline/dispatcher/mod.rs
  • rust/crates/scheduler/src/pipeline/entrypoint.rs
  • rust/crates/scheduler/src/pipeline/layer_permit.rs
  • rust/crates/scheduler/src/pipeline/matcher.rs
  • rust/crates/scheduler/src/pipeline/mod.rs
  • rust/crates/scheduler/src/pipeline/placement.rs
  • rust/crates/scheduler/src/resource_accounting.rs
  • rust/crates/scheduler/tests/redis_integration.rs
  • rust/crates/scheduler/tests/smoke_tests.rs
  • rust/crates/scheduler/tests/util.rs
💤 Files with no reviewable changes (5)
  • rust/crates/scheduler/src/resource_accounting.rs
  • rust/crates/scheduler/src/models/mod.rs
  • rust/crates/scheduler/src/models/subscription.rs
  • rust/crates/scheduler/src/dao/mod.rs
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatcherDaoJdbc.java

Comment thread cuebot/build.gradle
Comment on lines +41 to +44
implementation group: 'io.lettuce', name: 'lettuce-core', version: '6.3.2.RELEASE'
// Lettuce 6.3.x requires reactor-core 3.6.x (for reactor.core.publisher.Sinks).
// Spring Boot 2.2.1's dependency management would otherwise pin this to 3.3.x.
implementation group: 'io.projectreactor', name: 'reactor-core', version: '3.6.5'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for security advisories on the added dependencies

echo "=== Checking Lettuce security advisories ==="
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: MAVEN, package: "io.lettuce:lettuce-core") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

echo ""
echo "=== Checking Reactor Core security advisories ==="
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: MAVEN, package: "io.projectreactor:reactor-core") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

echo ""
echo "=== Checking latest Maven Central versions ==="
curl -s "https://search.maven.org/solrsearch/select?q=g:io.lettuce+AND+a:lettuce-core&rows=1&wt=json" | jq '.response.docs[0].latestVersion'
curl -s "https://search.maven.org/solrsearch/select?q=g:io.projectreactor+AND+a:reactor-core&rows=1&wt=json" | jq '.response.docs[0].latestVersion'

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 593


Update Lettuce dependency due to known vulnerability

io.lettuce:lettuce-core:6.3.2.RELEASE falls under a known MODERATE advisory range (< 6.5.1.RELEASE: “Withdrawn Advisory: Netty vulnerability included in redis lettuce”); upgrade to io.lettuce:lettuce-core:6.5.1.RELEASE or newer (latest on Maven Central: 6.7.1.RELEASE) and keep reactor-core on the required 3.6.x line for Lettuce compatibility. io.projectreactor:reactor-core:3.6.5 did not return any advisories in the checked results.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cuebot/build.gradle` around lines 41 - 44, Update the Lettuce dependency to a
non-vulnerable release by changing the dependency declaration for
io.lettuce:lettuce-core (currently '6.3.2.RELEASE') to at least '6.5.1.RELEASE'
(preferably the latest, e.g. '6.7.1.RELEASE'); keep the explicit reactor-core
override for io.projectreactor:reactor-core at the 3.6.x line (e.g. '3.6.5') so
Lettuce compatibility is preserved and remove/replace the old version reference
wherever io.lettuce:lettuce-core '6.3.2.RELEASE' is declared.

Comment thread docs/_docs/developer-guide/redis-accounting.md Outdated
Comment thread docs/_docs/developer-guide/redis-accounting.md Outdated
Comment thread docs/_docs/developer-guide/scheduler.md Outdated
Comment thread rust/crates/scheduler/src/accounting/redis_client.rs
Comment thread rust/crates/scheduler/src/accounting/redis_client.rs
Comment thread rust/crates/scheduler/src/cluster.rs Outdated
Comment thread rust/crates/scheduler/src/models/core_size.rs
Comment thread rust/crates/scheduler/src/pipeline/dispatcher/actor.rs
@DiegoTavares DiegoTavares changed the title Add scheduling mode: E-PVM [scheduler] Add scheduling mode: E-PVM Jun 9, 2026
Add an opt-in `Epvm` host-booking strategy alongside the existing `Saturation`
strategy (which stays the default). When `Epvm` is enabled, the host cache
scores candidate hosts using E-PVM stranding and picks the lowest-scoring host
within a configurable scan cap, replacing today's first-fit-by-saturation
iteration for the same call.
@DiegoTavares

Copy link
Copy Markdown
Collaborator Author

Actionable comments posted: 14

🧹 Nitpick comments (4)

rust/crates/scheduler/src/config/mod.rs (1)> 101-105: 💤 Low value

Consider adding authentication support to Redis URL construction.
The url() method constructs a basic redis://host:port/ URL without authentication. If production deployments require Redis AUTH, this method may need enhancement to support username/password or connection string overrides.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/src/config/mod.rs` around lines 101 - 105, The
RedisConfig::url() method currently returns "redis://host:port/" with no auth
support; update RedisConfig (struct) to include optional credentials (e.g.,
username: Option<String>, password: Option<String>) or an optional
raw_connection_string: Option<String>, and modify RedisConfig::url() to: if
raw_connection_string is Some return it; otherwise build the URL including
"redis://username:password@host:port/" when credentials are present (omit
username if only password needed per Redis URI rules) and percent-encode
credentials as necessary to avoid malformed URLs. Ensure you reference the
RedisConfig struct and the url() method so callers keep backwards compatibility
by keeping fields optional.

rust/crates/scheduler/src/pipeline/matcher.rs (1)> 306-312: ⚡ Quick win

Handle FeedMessage::Sleep send failures explicitly.
Ignoring the send result hides closed-channel failures; then this over-burst path returns without applying any backoff locally.

Suggested patch

-            let _ = feed_sender
-                .send(FeedMessage::Sleep(
-                    (*cluster).clone(),
-                    CONFIG.queue.cluster_empty_sleep,
-                ))
-                .await;
+            if let Err(err) = feed_sender
+                .send(FeedMessage::Sleep(
+                    (*cluster).clone(),
+                    CONFIG.queue.cluster_empty_sleep,
+                ))
+                .await
+            {
+                debug!(
+                    "Failed to enqueue cluster sleep for {} (over-burst path): {}",
+                    cluster, err
+                );
+                tokio::time::sleep(CONFIG.queue.cluster_empty_sleep).await;
+            }
             return;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/src/pipeline/matcher.rs` around lines 306 - 312, The
code currently ignores the result of
feed_sender.send(FeedMessage::Sleep((*cluster).clone(),
CONFIG.queue.cluster_empty_sleep)).await; change this to handle the Result
explicitly: if send returns Ok proceed as before, but if Err (channel closed)
log the failure (include context like cluster and that FeedMessage::Sleep
failed) and apply a local backoff before returning (e.g., await
tokio::time::sleep(Duration::from_millis(CONFIG.queue.cluster_empty_sleep)) or
equivalent) so the over-burst path does not immediately return without backoff;
update matcher.rs around the feed_sender send call to match on the Result and
handle Err accordingly.

rust/crates/scheduler/src/dao/resource_accounting_dao.rs (1)> 149-152: 💤 Low value

Redundant p.pk_show = ANY($1) filter in point recompute query.
Line 151 applies p.pk_show = ANY($1) in the LEFT JOIN condition, but p.pk_show = pt2.pk_show is already constrained by the join, and pt2.pk_show = ANY($1) at line 152 already filters the point rows. The filter at line 151 is redundant.
Not a correctness issue, but removing it would simplify the query slightly.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/src/dao/resource_accounting_dao.rs` around lines 149 -
152, Remove the redundant p.pk_show = ANY($1) predicate from the LEFT JOIN on
proc (alias p) in the point recompute SQL: the join already enforces p.pk_show =
pt2.pk_show and the WHERE clause pt2.pk_show = ANY($1) filters pt2, so delete
the p.pk_show = ANY($1) condition in the LEFT JOIN to simplify the query; update
the SQL string in resource_accounting_dao.rs where the LEFT JOIN proc p ... AND
p.pk_show = ANY($1) appears and ensure spacing/AND chaining remains valid after
removal.

pycue/tests/wrappers/test_show.py (1)> 191-201: ⚡ Quick win

Cover both toggle directions in the RPC assertion.
testSetSchedulerManaged currently validates only enabled=True. Adding enabled=False in the same test tightens contract coverage for the new toggle RPC with minimal effort.

♻️ Suggested test update

     def testSetSchedulerManaged(self, getStubMock):
         stubMock = mock.Mock()
         stubMock.SetSchedulerManaged.return_value = show_pb2.ShowSetSchedulerManagedResponse()
         getStubMock.return_value = stubMock
 
         show = opencue.wrappers.show.Show(show_pb2.Show(name=TEST_SHOW_NAME))
-        show.setSchedulerManaged(True)
-
-        stubMock.SetSchedulerManaged.assert_called_with(
-            show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=True),
-            timeout=mock.ANY)
+        show.setSchedulerManaged(True)
+        show.setSchedulerManaged(False)
+
+        stubMock.SetSchedulerManaged.assert_any_call(
+            show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=True),
+            timeout=mock.ANY)
+        stubMock.SetSchedulerManaged.assert_any_call(
+            show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=False),
+            timeout=mock.ANY)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pycue/tests/wrappers/test_show.py` around lines 191 - 201, The test only
asserts the RPC for enabling the scheduler; update testSetSchedulerManaged to
exercise both toggle directions by invoking show.setSchedulerManaged(True) and
show.setSchedulerManaged(False) (or call once and then again) and assert the
stubMock.SetSchedulerManaged was called for both cases using either two
assert_called_with checks referencing
show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=True) and
...enabled=False) or a single assert_has_calls with both expected requests;
target symbols: testSetSchedulerManaged, show.setSchedulerManaged,
stubMock.SetSchedulerManaged, and ShowSetSchedulerManagedRequest.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cuebot/build.gradle`:
- Around line 41-44: Update the Lettuce dependency to a non-vulnerable release
by changing the dependency declaration for io.lettuce:lettuce-core (currently
'6.3.2.RELEASE') to at least '6.5.1.RELEASE' (preferably the latest, e.g.
'6.7.1.RELEASE'); keep the explicit reactor-core override for
io.projectreactor:reactor-core at the 3.6.x line (e.g. '3.6.5') so Lettuce
compatibility is preserved and remove/replace the old version reference wherever
io.lettuce:lettuce-core '6.3.2.RELEASE' is declared.

In
`@cuebot/src/main/resources/conf/ddl/postgres/migrations/V44__scheduler_pending_query_indexes.sql`:
- Line 20: The manual backfill example in
V44__scheduler_pending_query_indexes.sql incorrectly inserts version 40 into
flyway_schema_history; update the VALUES tuple to use 44 to match this migration
(ensure the INSERT INTO flyway_schema_history(... ) VALUES (..., 44, ... )
reflects the V44__scheduler_pending_query_indexes.sql migration) so the
flyway_schema_history row matches the migration version and prevents poisoning
subsequent runs.

In `@docs/_docs/developer-guide/redis-accounting.md`:
- Around line 247-253: The two unlabeled fenced code blocks that contain the
numbered acct booking steps and the cueadmin command examples should be given
explicit languages to satisfy markdownlint MD040: change the fence wrapping the
numbered list (the block starting "1. Read current state of acct:sub /
acct:folder / acct:job / acct:layer / acct:point") to use a text label (e.g.,
```text) and change the fence wrapping the cueadmin examples (the block with
"cueadmin -show <name> -setSchedulerManaged true/false") to use a shell label
(e.g., ```bash); update both occurrences (the one around the acct steps and the
one around the cueadmin commands) accordingly.
- Line 14: The heading "How Cuebot and the Rust scheduler coordinate per-show
resource accounting through Redis" is using ### causing a level jump from # to
### and failing MD001; change that heading token from "###" to "##" so it
follows the top-level "#" and fixes the markdown lint error (update the heading
line where that exact text appears).

In `@docs/_docs/developer-guide/scheduler.md`:
- Around line 648-654: The fenced code block listing the scheduler steps
(starting "1. Read current state of acct:sub / acct:folder / acct:job /
acct:layer / acct:point" through "5. Then transactionally INSERT proc in
Postgres (outside Lua)") is unlabeled and triggers MD040; add a language label
to the opening triple-backtick (e.g., ```text) to satisfy the linter and
preserve formatting so the block is treated as a code/preformatted block.

In `@docs/_docs/getting-started/deploying-scheduler.md`:
- Around line 262-265: The docs contain inconsistent cueadmin usage: keep the
correct form "cueadmin -scheduler-managed <SHOW> ON|OFF" and replace any
incorrect examples using "-show <name> -setSchedulerManaged true|false" (look
for occurrences of the flags -setSchedulerManaged and -show in docs, e.g.,
developer-guide scheduler and redis-accounting sections); update those examples
to use the exact flag and value format "-scheduler-managed <SHOW> ON" or "OFF"
so all docs match the working CLI syntax.

In `@pycue/opencue/wrappers/show.py`:
- Around line 223-235: setSchedulerManaged currently calls
stub.SetSchedulerManaged but doesn't update the wrapper's local state, so
subsequent schedulerManaged() reads can be stale; after calling
SetSchedulerManaged (in setSchedulerManaged), assign the returned value to
self.data.scheduler_managed (or set self.data.scheduler_managed = value if the
RPC returns no body) using the response from stub.SetSchedulerManaged to
synchronize the wrapper, ensuring schedulerManaged() reflects the new state.

In `@rust/crates/scheduler/src/accounting/dao.rs`:
- Around line 146-157: The query_managed_show_ids function currently maps DB
rows using parse_uuid which unwraps and will panic on malformed UUIDs; change
the handling to use a fallible parse instead: update parse_uuid (in
dao/helpers.rs) to return Result<Uuid, Report> or provide a new try_parse_uuid
that calls Uuid::parse_str and returns an Err on failure, then modify
query_managed_show_ids (and other callers) to collect a Result<Vec<Uuid>, _> or
filter/log and skip invalid rows rather than calling unwrap—propagate the error
with wrap_err for diagnostics or log and skip malformed rows so a single corrupt
DB value cannot crash the scheduler.

In `@rust/crates/scheduler/src/accounting/recompute.rs`:
- Around line 46-52: The current recompute task exits permanently if
ResourceAccountingDao::new().await returns Err, causing a long-lived outage;
change the initialization to retry instead of returning: wrap the call to
ResourceAccountingDao::new().await (the code creating pg_dao/Arc::new(d)) in a
retry loop with exponential backoff (or fixed delay) and error logging, continue
retrying until success, and only proceed to create pg_dao and enter the
recompute loop after a successful Result; ensure any temporary failures do not
cause an early return from the recompute task.

In `@rust/crates/scheduler/src/accounting/redis_client.rs`:
- Around line 142-144: Update the unit wording in the Redis client docs: change
"centicores" to "cores" in the doc comment that begins "Reads the subscription
hash's booked cores + burst in one round-trip..." and the other similar doc
comment around the subsequent block (the occurrences currently at the same doc
block and the one at lines ~161-162). Ensure both doc comments and any inline
mention in redis_client.rs consistently state that Redis counters operate in
cores (not centicores) so the Lua contract and callers are not misled.
- Line 155: The call currently uses conn.hget(&key, &["int_cores",
"burst"]).await? which is a single-field read API but is passing multiple
fields; change this to the multi-field API conn.hmget(&key, &["int_cores",
"burst"]).await? (keeping the expected type Vec<Option<i64>> and the surrounding
error handling) so the Redis client sends HMGET for the fields "int_cores" and
"burst" and maps into values as before.

In `@rust/crates/scheduler/src/cluster.rs`:
- Around line 418-423: The code can panic when current_index_atomic contains an
out-of-range value after self.clusters shrinks; before indexing clusters use the
current clusters length to clamp the index (and handle empty clusters).
Specifically, in the block around current_index_atomic.load, get let len =
clusters.len(); if len == 0 return/skip; compute a safe_index = current_index %
len (or safe_index = min(current_index, len-1)), then read item =
clusters[safe_index].clone(); compute next_index = (safe_index + 1) % len and
store that back into current_index_atomic; this prevents indexing with stale,
too-large indices while preserving wrap-around detection and forward progress.

In `@rust/crates/scheduler/src/models/core_size.rs`:
- Around line 68-71: The CoreSize conversion functions
(CoreSize::from_multiplied and CoreSize::from_multiplied_cap) perform unchecked
i64 -> i32 narrowing using `as i32`, which can silently wrap; change them to use
a checked conversion (e.g., try_into / i32::try_from) or explicit saturation
with bounds checking against i32::MIN/i32::MAX and handle failures by returning
a Result or clamping to safe limits. Update the functions that reference
CONFIG.queue.core_multiplier to compute the intermediate i64, check for
overflow/underflow before converting, and either return an error (propagate
Result) or clamp to i32 bounds and document the behavior so no silent wrapping
occurs.

In `@rust/crates/scheduler/src/pipeline/dispatcher/actor.rs`:
- Around line 482-491: BookingDelta is being built with DispatchLayer.folder_id
and dept_id that may be Uuid::nil(), which would produce incorrect
acct:folder:{folder_id} and acct:point:{dept_id}:{show_id} Redis keys; add an
explicit nil-UUID guard before constructing BookingDelta (best place: inside
dispatch_virtual_proc right before creating BookingDelta or at the callsite in
dispatch_inner/consume_host_virtual_resources) that checks
DispatchLayer.folder_id.is_nil() and DispatchLayer.dept_id.is_nil() and
rejects/returns an Err (or skips booking) with a logged error; ensure the check
references the DispatchLayer fields and prevents creating BookingDelta when
either UUID is nil so no acct:folder or acct:point keys are written for the zero
UUID.

---

Nitpick comments:
In `@pycue/tests/wrappers/test_show.py`:
- Around line 191-201: The test only asserts the RPC for enabling the scheduler;
update testSetSchedulerManaged to exercise both toggle directions by invoking
show.setSchedulerManaged(True) and show.setSchedulerManaged(False) (or call once
and then again) and assert the stubMock.SetSchedulerManaged was called for both
cases using either two assert_called_with checks referencing
show_pb2.ShowSetSchedulerManagedRequest(show=show.data, enabled=True) and
...enabled=False) or a single assert_has_calls with both expected requests;
target symbols: testSetSchedulerManaged, show.setSchedulerManaged,
stubMock.SetSchedulerManaged, and ShowSetSchedulerManagedRequest.

In `@rust/crates/scheduler/src/config/mod.rs`:
- Around line 101-105: The RedisConfig::url() method currently returns
"redis://host:port/" with no auth support; update RedisConfig (struct) to
include optional credentials (e.g., username: Option<String>, password:
Option<String>) or an optional raw_connection_string: Option<String>, and modify
RedisConfig::url() to: if raw_connection_string is Some return it; otherwise
build the URL including "redis://username:password@host:port/" when credentials
are present (omit username if only password needed per Redis URI rules) and
percent-encode credentials as necessary to avoid malformed URLs. Ensure you
reference the RedisConfig struct and the url() method so callers keep backwards
compatibility by keeping fields optional.

In `@rust/crates/scheduler/src/dao/resource_accounting_dao.rs`:
- Around line 149-152: Remove the redundant p.pk_show = ANY($1) predicate from
the LEFT JOIN on proc (alias p) in the point recompute SQL: the join already
enforces p.pk_show = pt2.pk_show and the WHERE clause pt2.pk_show = ANY($1)
filters pt2, so delete the p.pk_show = ANY($1) condition in the LEFT JOIN to
simplify the query; update the SQL string in resource_accounting_dao.rs where
the LEFT JOIN proc p ... AND p.pk_show = ANY($1) appears and ensure spacing/AND
chaining remains valid after removal.

In `@rust/crates/scheduler/src/pipeline/matcher.rs`:
- Around line 306-312: The code currently ignores the result of
feed_sender.send(FeedMessage::Sleep((*cluster).clone(),
CONFIG.queue.cluster_empty_sleep)).await; change this to handle the Result
explicitly: if send returns Ok proceed as before, but if Err (channel closed)
log the failure (include context like cluster and that FeedMessage::Sleep
failed) and apply a local backoff before returning (e.g., await
tokio::time::sleep(Duration::from_millis(CONFIG.queue.cluster_empty_sleep)) or
equivalent) so the over-burst path does not immediately return without backoff;
update matcher.rs around the feed_sender send call to match on the Result and
handle Err accordingly.

🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a7aa165-b9e9-4987-a82f-97ec0dd9c477

📥 Commits
Reviewing files that changed from the base of the PR and between e504bc9 and eac7ebf.

📒 Files selected for processing (82)

  • .gitignore
  • VERSION.in
  • cueadmin/cueadmin/common.py
  • cueadmin/tests/test_common.py
  • cuebot/build.gradle
  • cuebot/src/main/java/com/imageworks/spcue/ShowEntity.java
  • cuebot/src/main/java/com/imageworks/spcue/VirtualProc.java
  • cuebot/src/main/java/com/imageworks/spcue/config/AppConfig.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/ShowDao.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatchQuery.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatcherDaoJdbc.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/ProcDaoJdbc.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/ShowDaoJdbc.java
  • cuebot/src/main/java/com/imageworks/spcue/dao/postgres/WhiteboardDaoJdbc.java
  • cuebot/src/main/java/com/imageworks/spcue/servant/ManageShow.java
  • cuebot/src/main/java/com/imageworks/spcue/service/AccountingRedisPublisher.java
  • cuebot/src/main/java/com/imageworks/spcue/service/LettuceAccountingRedisPublisher.java
  • cuebot/src/main/resources/conf/ddl/postgres/migrations/V44__scheduler_pending_query_indexes.sql
  • cuebot/src/main/resources/conf/ddl/postgres/migrations/V45__show_scheduler_managed.sql
  • cuebot/src/main/resources/conf/spring/applicationContext-accounting.xml
  • cuebot/src/main/resources/opencue.properties
  • cuebot/src/test/java/com/imageworks/spcue/config/TestAppConfig.java
  • cuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/DispatcherDaoTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/ProcDaoTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/dao/postgres/ShowDaoTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/servant/ManageShowTests.java
  • cuebot/src/test/java/com/imageworks/spcue/test/service/LettuceAccountingRedisPublisherTests.java
  • docs/_docs/developer-guide/redis-accounting.md
  • docs/_docs/developer-guide/scheduler.md
  • docs/_docs/getting-started/deploying-scheduler.md
  • docs/news/2025-12-12-distributed-scheduler-release.md
  • proto/src/show.proto
  • pycue/opencue/wrappers/show.py
  • pycue/tests/wrappers/test_show.py
  • rust/config/scheduler.yaml
  • rust/crates/scheduler/Cargo.toml
  • rust/crates/scheduler/src/accounting/booking_delta.rs
  • rust/crates/scheduler/src/accounting/bootstrap.rs
  • rust/crates/scheduler/src/accounting/dao.rs
  • rust/crates/scheduler/src/accounting/error.rs
  • rust/crates/scheduler/src/accounting/limit_reseed.rs
  • rust/crates/scheduler/src/accounting/lua.rs
  • rust/crates/scheduler/src/accounting/managed_shows.rs
  • rust/crates/scheduler/src/accounting/mod.rs
  • rust/crates/scheduler/src/accounting/recompute.rs
  • rust/crates/scheduler/src/accounting/redis_client.rs
  • rust/crates/scheduler/src/cluster.rs
  • rust/crates/scheduler/src/cluster_key.rs
  • rust/crates/scheduler/src/config/mod.rs
  • rust/crates/scheduler/src/dao/cluster_dao.rs
  • rust/crates/scheduler/src/dao/frame_dao.rs
  • rust/crates/scheduler/src/dao/host_dao.rs
  • rust/crates/scheduler/src/dao/job_dao.rs
  • rust/crates/scheduler/src/dao/layer_dao.rs
  • rust/crates/scheduler/src/dao/mod.rs
  • rust/crates/scheduler/src/dao/proc_dao.rs
  • rust/crates/scheduler/src/dao/resource_accounting_dao.rs
  • rust/crates/scheduler/src/host_cache/actor.rs
  • rust/crates/scheduler/src/host_cache/cache.rs
  • rust/crates/scheduler/src/host_cache/messages.rs
  • rust/crates/scheduler/src/host_cache/mod.rs
  • rust/crates/scheduler/src/lib.rs
  • rust/crates/scheduler/src/main.rs
  • rust/crates/scheduler/src/metrics/mod.rs
  • rust/crates/scheduler/src/models/core_size.rs
  • rust/crates/scheduler/src/models/frame.rs
  • rust/crates/scheduler/src/models/layer.rs
  • rust/crates/scheduler/src/models/mod.rs
  • rust/crates/scheduler/src/models/subscription.rs
  • rust/crates/scheduler/src/models/virtual_proc.rs
  • rust/crates/scheduler/src/pipeline/dispatcher/actor.rs
  • rust/crates/scheduler/src/pipeline/dispatcher/error.rs
  • rust/crates/scheduler/src/pipeline/dispatcher/mod.rs
  • rust/crates/scheduler/src/pipeline/entrypoint.rs
  • rust/crates/scheduler/src/pipeline/layer_permit.rs
  • rust/crates/scheduler/src/pipeline/matcher.rs
  • rust/crates/scheduler/src/pipeline/mod.rs
  • rust/crates/scheduler/src/pipeline/placement.rs
  • rust/crates/scheduler/src/resource_accounting.rs
  • rust/crates/scheduler/tests/redis_integration.rs
  • rust/crates/scheduler/tests/smoke_tests.rs
  • rust/crates/scheduler/tests/util.rs

💤 Files with no reviewable changes (5)

This review was done before the branch has been properly rebased with master and contained many comments unrelated to the actual work.

@DiegoTavares DiegoTavares marked this pull request as ready for review June 11, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant