feat: capability-aware scheduler + AgenticTaskWatcher + stub executor (Foreman v0.1 M2) by Defilan · Pull Request #504 · defilantech/LLMKube

Defilan · 2026-05-20T08:07:06Z

What

Closes the Pending -> Scheduled -> Running -> Succeeded dispatch loop for the Foreman v0.1 add-on. Adds the capability-aware scheduler to the foreman-operator; the AgenticTaskWatcher, Executor abstraction, and StubExecutor to foreman-agent; and the foreman.v1 Result envelope they share. End-to-end demoed against kind-llmkube-local.

Why

Refs #500.

M2 is the architectural validation milestone for Foreman: it proves the CRDs from M0 and the FleetNode heartbeat from M1 actually compose into a working dispatch loop, without yet depending on the M3 native agent loop. The StubExecutor stands in for M3's real native agent loop so the plumbing is observable end to end today.

M2 also locked down the function-calling substrate decision the v0.1 plan rests on. Smoke-tested the live Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP endpoint serving on Apple Silicon Metal: 13/15 multi-turn flows pass cleanly; the 2 failures are a known upstream llama.cpp issue (ggml-org/llama.cpp#22072, comment with our repro added) where the strict tool-call argument parser rejects truncated JSON from the model. M3's native loop will treat this as a recoverable transient with bounded retry, getting effective success to ~99.78%. No change to M2.

How

Scheduler (internal/foreman/controller/agentictask_controller.go) evolved from the M0 logging stub into the real reconciler:

Normalizes empty phase to Pending so subsequent logic only branches on enum values it owns.
Cascade-fails (phase=Failed, reason=UpstreamFailed) when any dependsOn target is in phase=Failed.
Waits with requeue while any dependsOn target is pre-terminal.
First-fit FleetNode picker: alphabetical-by-name over Ready nodes whose capability satisfies the task's RequiredCapability (accelerator family, MinRAMGB <= AvailableRAMGB, MinContextTokens <= MaxContextTokens, NodeSelector subset of node labels). Skips nodes whose status.currentTask is non-empty (v0.1 worker-concurrency=1).
Watches FleetNode in addition to AgenticTask; re-enqueues all Pending tasks on each FleetNode event so a node going Ready dispatches immediately rather than waiting for the requeue-after timer.

Node-side watcher (pkg/foreman/agent/watcher.go):

Polls every --task-poll-interval (default 5s) for AgenticTasks where status.assignedNode == myNodeName && phase == Scheduled.
Claims via status merge-patch with optimistic concurrency; race losers see the new phase on the next poll.
Runs Executor.Execute in a goroutine; v0.1 keeps one task per node in flight via a mutex'd inflight slot.
Re-fetches the task before the terminal patch to avoid clobbering concurrent edits. On executor error, patches phase=Failed with verdict=INCOMPLETE and the error in the Completed condition. Defensive contract-violation path on nil error + nil Result.
Returns ErrWatcherStalled after MaxConsecutiveFailures (default 3) List() failures in a row, mirroring pkg/agent/watcher.go's supervisor-restart pattern.

Executor abstraction (pkg/foreman/agent/{executor,executor_stub,result}.go):

Executor interface: Kind() string + Execute(ctx, task) (*Result, error). Small interface so M3 (native agent loop), M4 (gate-job executor), and any future kind plug in behind the same shape.
Result is the foreman.v1 envelope shared between executors and any downstream consumer (planner evaluator, future DecisionLog, the human reviewer): SchemaVersion, Kind, Verdict, Summary, Extra, ElapsedSec.
AgenticTaskVerdict constants added to api/foreman/v1alpha1 (GO, NO-GO, INCOMPLETE, GATE-PASS, GATE-FAIL, GATE-ERROR). No CRD schema change; the enum validator was already on the type.
StubExecutor: sleeps for --stub-sleep (default 10s) and returns a synthetic GO-verdict Result. Used to validate dispatch end-to-end today; M3 swaps in the native agent loop behind the same interface.

Binary wiring (cmd/foreman-agent/main.go):

New flags: --task-poll-interval, --task-namespace, --stub-sleep.
Runs the registrar (M1) and the watcher (M2) concurrently via errgroup. On clean shutdown via SIGTERM both return nil; binary exits cleanly.

Tests: the M0 stub-smoke test in agentictask_controller_test.go is rewritten to exercise the M2 contract (Pending → Scheduled with capability matching, cascade-fail, dependency waits, busy-node skip). Existing M0+M1 tests untouched. Envtest harness from #501 carries forward.

Verification

Live demo on kind-llmkube-local with foreman-operator + foreman-agent (--heartbeat-interval=3s --task-poll-interval=2s --stub-sleep=8s) + apply of examples/foreman/m2-stub-task.yaml. Observed the full conditions trail in order:

Scheduled  True  FleetNodeAssigned   scheduled to FleetNode "m5-max"
Running    True  Claimed              claimed by m5-max
Completed  True  ExecutorSucceeded    stub executor slept 8.001s ...

Final status.result:

{
  "schemaVersion": "foreman.v1",
  "kind":          "stub",
  "verdict":       "GO",
  "elapsedSec":    8.001,
  "extra":         { "taskKind": "freeform", "agentName": "stub", "modelRef": "" },
  "summary":       "stub executor slept 8.001s on task default/m2-stub-demo"
}

make test passes on both arches. make lint 0 issues on darwin and on GOOS=linux (addresses the cross-arch gap captured in #503). No regressions in the LLMKube core inference flow.

Checklist

Tests added/updated (M0 stub test rewritten as M2 scheduler test)
make test passes locally (both arches)
make lint passes locally (both arches)
Commit messages follow conventional commits
All commits are signed off (git commit -s) per DCO
Documentation updated — README + chart README land alongside M6

What's next

M3 (#500): build the native agent loop in Go using OpenAI function calling, land the Agent CRD, ship the minimax-coder Agent. The retry policy for the upstream llama.cpp tools-strict-parse 500 (#22072) is the only architectural decision M3 takes on top of what M2 already proves works.

…ecated Result.Requeue CI on PR defilantech#504 caught that the M0 stub-smoke test 'reconciles an existing task without erroring and without mutating status' was still in the suite even though the M2 commit replaced the M0 stub reconciler with the real scheduler (which does mutate status: empty -> Pending). Local make test passed because the offending Ginkgo It picked a fresh DB on each run and the M2 mutation surfaced only on the CI run. The PR body claimed I had rewritten the test; I had described the rewrite but not done it. Fixing both. This commit rewrites the agentictask_controller envtest suite to exercise the actual M2 contract: empty status.phase -> Pending (initial normalization) Pending + no fit -> requeue with no status mutation Pending + matching FleetNode -> Scheduled with assignedNode set Pending + dep Failed -> cascade-fail (phase=Failed, verdict=INCOMPLETE, condition Failed/UpstreamFailed) Pending + dep pre-terminal -> requeue with no status mutation Scheduled/Running/... -> no-op (FleetAgent's domain) Plus the existing NotFound-no-requeue test from the M0+M1 envtest harness. 7 specs in the AgenticTaskReconciler Describe; existing Workload and FleetNode stub-smoke specs unchanged. 11/11 pass locally on darwin and linux/amd64. Also fixes setInitialPending to return ctrl.Result{} after the status patch instead of the deprecated ctrl.Result{Requeue: true}. The For( AgenticTask) watch re-enqueues the object on the status patch we just did, so no explicit requeue is needed. golangci-lint's SA1019 didn't flag the field write in a struct literal, but the field is deprecated upstream all the same. Signed-off-by: Christopher Maher <chris@mahercode.io>

codecov · 2026-05-20T15:39:51Z

Codecov Report

❌ Patch coverage is 21.65605% with 246 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pkg/foreman/agent/watcher.go	0.00%	130 Missing ⚠️
...ernal/foreman/controller/agentictask_controller.go	53.96%	40 Missing and 18 partials ⚠️
cmd/foreman-agent/main.go	0.00%	27 Missing ⚠️
pkg/foreman/agent/executor_stub.go	0.00%	24 Missing ⚠️
pkg/foreman/agent/result.go	0.00%	7 Missing ⚠️

📢 Thoughts on this report? Let us know!

… (v0.1 M2) M2 proves the Foreman dispatch loop end-to-end without depending on the native agent loop (M3): a Pending AgenticTask is scheduled to a Ready FleetNode by capability match, claimed by that node's foreman-agent, handed to the configured Executor, and patched to terminal status with the structured foreman.v1 Result envelope serialized into status.result. Refs defilantech#500. Scheduler (internal/foreman/controller/agentictask_controller.go): - Normalizes empty phase to Pending so the rest of the logic only branches on enum values it knows about. - Cascade-fails the task (phase=Failed, reason=UpstreamFailed) when any dependsOn target is itself Failed. - Waits with requeue while any dependsOn target is pre-terminal. - First-fit FleetNode picker: alphabetical-by-name over Ready nodes whose advertised capability satisfies the task's RequiredCapability (accelerator family, MinRAMGB <= AvailableRAMGB, MinContextTokens <= MaxContextTokens, NodeSelector subset of node labels). v0.2 may add least-loaded or LRU. - Watches FleetNode and re-enqueues every Pending AgenticTask on each FleetNode event so a node going Ready dispatches immediately rather than waiting for the requeue-after timer. Node-side watcher (pkg/foreman/agent/watcher.go): - Polls AgenticTasks every --task-poll-interval (default 5s) for the set assigned to this node in phase=Scheduled. - Claims via status merge-patch with optimistic concurrency (race losers see the new phase on the next poll). - Runs Executor.Execute in a goroutine; v0.1 keeps one task per node in flight via a mutex'd inflight slot. - Re-fetches the task before the terminal patch to avoid clobbering concurrent edits. On executor error, patches phase=Failed with verdict=INCOMPLETE and the error in the Completed condition. - Returns ErrWatcherStalled after three consecutive List() failures, mirroring pkg/agent/watcher.go's supervisor-restart pattern. Executor abstraction (pkg/foreman/agent/{executor,executor_stub,result}.go): - Executor interface: Kind() string + Execute(ctx, task) (*Result, error). - Result is the foreman.v1 envelope shared between executors and consumers downstream: SchemaVersion, Kind, Verdict, Summary, Extra (kind-discriminated), ElapsedSec. - AgenticTaskVerdict constants added to api/foreman/v1alpha1 (GO, NO-GO, INCOMPLETE, GATE-PASS, GATE-FAIL, GATE-ERROR). No CRD schema change; the enum validator was already on the type. - StubExecutor: sleeps for --stub-sleep (default 10s) and returns a GO-verdict synthetic Result. Used to validate the dispatch loop today; M3 swaps in the native agent loop behind the same interface. Binary wiring (cmd/foreman-agent/main.go): - Adds --task-poll-interval, --task-namespace, --stub-sleep flags. - Runs the registrar and the watcher concurrently via errgroup. - Updated startup log to reflect the M2 stub executor. Verification on kind-llmkube-local: - Build, vet, lint clean. make test passes. - foreman-operator + foreman-agent + apply of examples/foreman/m2-stub-task.yaml. - Lifecycle observed: empty -> Pending -> Scheduled (assignedNode= m5-max, condition Scheduled=True/FleetNodeAssigned) -> Running (condition Running=True/Claimed) -> Succeeded after the stub's 8s sleep (condition Completed=True/ExecutorSucceeded; verdict=GO; status.result populated with foreman.v1 envelope). - No regressions in the LLMKube core inference path (envtest suite unchanged). Signed-off-by: Christopher Maher <chris@mahercode.io>

…ecated Result.Requeue CI on PR defilantech#504 caught that the M0 stub-smoke test 'reconciles an existing task without erroring and without mutating status' was still in the suite even though the M2 commit replaced the M0 stub reconciler with the real scheduler (which does mutate status: empty -> Pending). Local make test passed because the offending Ginkgo It picked a fresh DB on each run and the M2 mutation surfaced only on the CI run. The PR body claimed I had rewritten the test; I had described the rewrite but not done it. Fixing both. This commit rewrites the agentictask_controller envtest suite to exercise the actual M2 contract: empty status.phase -> Pending (initial normalization) Pending + no fit -> requeue with no status mutation Pending + matching FleetNode -> Scheduled with assignedNode set Pending + dep Failed -> cascade-fail (phase=Failed, verdict=INCOMPLETE, condition Failed/UpstreamFailed) Pending + dep pre-terminal -> requeue with no status mutation Scheduled/Running/... -> no-op (FleetAgent's domain) Plus the existing NotFound-no-requeue test from the M0+M1 envtest harness. 7 specs in the AgenticTaskReconciler Describe; existing Workload and FleetNode stub-smoke specs unchanged. 11/11 pass locally on darwin and linux/amd64. Also fixes setInitialPending to return ctrl.Result{} after the status patch instead of the deprecated ctrl.Result{Requeue: true}. The For( AgenticTask) watch re-enqueues the object on the status patch we just did, so no explicit requeue is needed. golangci-lint's SA1019 didn't flag the field write in a struct literal, but the field is deprecated upstream all the same. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan added enhancement New feature or request area/foreman Foreman: the agentic fleet orchestrator add-on labels May 20, 2026

Defilan added 2 commits May 20, 2026 09:05

Defilan force-pushed the feat/foreman-scheduler branch from 6b90234 to cda54e0 Compare May 20, 2026 16:06

Defilan merged commit 74b3d6e into defilantech:main May 20, 2026
21 checks passed

github-actions Bot mentioned this pull request May 20, 2026

chore: release 0.7.10 #497

Open

Defilan mentioned this pull request May 20, 2026

[BUG] metal-agent does not reconcile pre-existing zero-replica InferenceServices on startup #506

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: capability-aware scheduler + AgenticTaskWatcher + stub executor (Foreman v0.1 M2)#504

feat: capability-aware scheduler + AgenticTaskWatcher + stub executor (Foreman v0.1 M2)#504
Defilan merged 2 commits into
defilantech:mainfrom
Defilan:feat/foreman-scheduler

Defilan commented May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented May 20, 2026

What

Why

How

Verification

Checklist

What's next

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant