feat: capability-aware scheduler + AgenticTaskWatcher + stub executor (Foreman v0.1 M2)#504
Merged
Merged
Conversation
Defilan
added a commit
to Defilan/LLMKube
that referenced
this pull request
May 20, 2026
…ecated Result.Requeue CI on PR defilantech#504 caught that the M0 stub-smoke test 'reconciles an existing task without erroring and without mutating status' was still in the suite even though the M2 commit replaced the M0 stub reconciler with the real scheduler (which does mutate status: empty -> Pending). Local make test passed because the offending Ginkgo It picked a fresh DB on each run and the M2 mutation surfaced only on the CI run. The PR body claimed I had rewritten the test; I had described the rewrite but not done it. Fixing both. This commit rewrites the agentictask_controller envtest suite to exercise the actual M2 contract: empty status.phase -> Pending (initial normalization) Pending + no fit -> requeue with no status mutation Pending + matching FleetNode -> Scheduled with assignedNode set Pending + dep Failed -> cascade-fail (phase=Failed, verdict=INCOMPLETE, condition Failed/UpstreamFailed) Pending + dep pre-terminal -> requeue with no status mutation Scheduled/Running/... -> no-op (FleetAgent's domain) Plus the existing NotFound-no-requeue test from the M0+M1 envtest harness. 7 specs in the AgenticTaskReconciler Describe; existing Workload and FleetNode stub-smoke specs unchanged. 11/11 pass locally on darwin and linux/amd64. Also fixes setInitialPending to return ctrl.Result{} after the status patch instead of the deprecated ctrl.Result{Requeue: true}. The For( AgenticTask) watch re-enqueues the object on the status patch we just did, so no explicit requeue is needed. golangci-lint's SA1019 didn't flag the field write in a struct literal, but the field is deprecated upstream all the same. Signed-off-by: Christopher Maher <chris@mahercode.io>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
… (v0.1 M2) M2 proves the Foreman dispatch loop end-to-end without depending on the native agent loop (M3): a Pending AgenticTask is scheduled to a Ready FleetNode by capability match, claimed by that node's foreman-agent, handed to the configured Executor, and patched to terminal status with the structured foreman.v1 Result envelope serialized into status.result. Refs defilantech#500. Scheduler (internal/foreman/controller/agentictask_controller.go): - Normalizes empty phase to Pending so the rest of the logic only branches on enum values it knows about. - Cascade-fails the task (phase=Failed, reason=UpstreamFailed) when any dependsOn target is itself Failed. - Waits with requeue while any dependsOn target is pre-terminal. - First-fit FleetNode picker: alphabetical-by-name over Ready nodes whose advertised capability satisfies the task's RequiredCapability (accelerator family, MinRAMGB <= AvailableRAMGB, MinContextTokens <= MaxContextTokens, NodeSelector subset of node labels). v0.2 may add least-loaded or LRU. - Watches FleetNode and re-enqueues every Pending AgenticTask on each FleetNode event so a node going Ready dispatches immediately rather than waiting for the requeue-after timer. Node-side watcher (pkg/foreman/agent/watcher.go): - Polls AgenticTasks every --task-poll-interval (default 5s) for the set assigned to this node in phase=Scheduled. - Claims via status merge-patch with optimistic concurrency (race losers see the new phase on the next poll). - Runs Executor.Execute in a goroutine; v0.1 keeps one task per node in flight via a mutex'd inflight slot. - Re-fetches the task before the terminal patch to avoid clobbering concurrent edits. On executor error, patches phase=Failed with verdict=INCOMPLETE and the error in the Completed condition. - Returns ErrWatcherStalled after three consecutive List() failures, mirroring pkg/agent/watcher.go's supervisor-restart pattern. Executor abstraction (pkg/foreman/agent/{executor,executor_stub,result}.go): - Executor interface: Kind() string + Execute(ctx, task) (*Result, error). - Result is the foreman.v1 envelope shared between executors and consumers downstream: SchemaVersion, Kind, Verdict, Summary, Extra (kind-discriminated), ElapsedSec. - AgenticTaskVerdict constants added to api/foreman/v1alpha1 (GO, NO-GO, INCOMPLETE, GATE-PASS, GATE-FAIL, GATE-ERROR). No CRD schema change; the enum validator was already on the type. - StubExecutor: sleeps for --stub-sleep (default 10s) and returns a GO-verdict synthetic Result. Used to validate the dispatch loop today; M3 swaps in the native agent loop behind the same interface. Binary wiring (cmd/foreman-agent/main.go): - Adds --task-poll-interval, --task-namespace, --stub-sleep flags. - Runs the registrar and the watcher concurrently via errgroup. - Updated startup log to reflect the M2 stub executor. Verification on kind-llmkube-local: - Build, vet, lint clean. make test passes. - foreman-operator + foreman-agent + apply of examples/foreman/m2-stub-task.yaml. - Lifecycle observed: empty -> Pending -> Scheduled (assignedNode= m5-max, condition Scheduled=True/FleetNodeAssigned) -> Running (condition Running=True/Claimed) -> Succeeded after the stub's 8s sleep (condition Completed=True/ExecutorSucceeded; verdict=GO; status.result populated with foreman.v1 envelope). - No regressions in the LLMKube core inference path (envtest suite unchanged). Signed-off-by: Christopher Maher <chris@mahercode.io>
…ecated Result.Requeue CI on PR defilantech#504 caught that the M0 stub-smoke test 'reconciles an existing task without erroring and without mutating status' was still in the suite even though the M2 commit replaced the M0 stub reconciler with the real scheduler (which does mutate status: empty -> Pending). Local make test passed because the offending Ginkgo It picked a fresh DB on each run and the M2 mutation surfaced only on the CI run. The PR body claimed I had rewritten the test; I had described the rewrite but not done it. Fixing both. This commit rewrites the agentictask_controller envtest suite to exercise the actual M2 contract: empty status.phase -> Pending (initial normalization) Pending + no fit -> requeue with no status mutation Pending + matching FleetNode -> Scheduled with assignedNode set Pending + dep Failed -> cascade-fail (phase=Failed, verdict=INCOMPLETE, condition Failed/UpstreamFailed) Pending + dep pre-terminal -> requeue with no status mutation Scheduled/Running/... -> no-op (FleetAgent's domain) Plus the existing NotFound-no-requeue test from the M0+M1 envtest harness. 7 specs in the AgenticTaskReconciler Describe; existing Workload and FleetNode stub-smoke specs unchanged. 11/11 pass locally on darwin and linux/amd64. Also fixes setInitialPending to return ctrl.Result{} after the status patch instead of the deprecated ctrl.Result{Requeue: true}. The For( AgenticTask) watch re-enqueues the object on the status patch we just did, so no explicit requeue is needed. golangci-lint's SA1019 didn't flag the field write in a struct literal, but the field is deprecated upstream all the same. Signed-off-by: Christopher Maher <chris@mahercode.io>
6b90234 to
cda54e0
Compare
Open
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Closes the Pending -> Scheduled -> Running -> Succeeded dispatch loop for the Foreman v0.1 add-on. Adds the capability-aware scheduler to the foreman-operator; the AgenticTaskWatcher, Executor abstraction, and StubExecutor to foreman-agent; and the foreman.v1 Result envelope they share. End-to-end demoed against kind-llmkube-local.
Why
Refs #500.
M2 is the architectural validation milestone for Foreman: it proves the CRDs from M0 and the FleetNode heartbeat from M1 actually compose into a working dispatch loop, without yet depending on the M3 native agent loop. The StubExecutor stands in for M3's real native agent loop so the plumbing is observable end to end today.
M2 also locked down the function-calling substrate decision the v0.1 plan rests on. Smoke-tested the live Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP endpoint serving on Apple Silicon Metal: 13/15 multi-turn flows pass cleanly; the 2 failures are a known upstream llama.cpp issue (ggml-org/llama.cpp#22072, comment with our repro added) where the strict tool-call argument parser rejects truncated JSON from the model. M3's native loop will treat this as a recoverable transient with bounded retry, getting effective success to ~99.78%. No change to M2.
How
Scheduler (
internal/foreman/controller/agentictask_controller.go) evolved from the M0 logging stub into the real reconciler:Pendingso subsequent logic only branches on enum values it owns.dependsOntarget is in phase=Failed.dependsOntarget is pre-terminal.RequiredCapability(accelerator family,MinRAMGB <= AvailableRAMGB,MinContextTokens <= MaxContextTokens,NodeSelectorsubset of node labels). Skips nodes whosestatus.currentTaskis non-empty (v0.1 worker-concurrency=1).FleetNodein addition toAgenticTask; re-enqueues all Pending tasks on each FleetNode event so a node going Ready dispatches immediately rather than waiting for the requeue-after timer.Node-side watcher (
pkg/foreman/agent/watcher.go):--task-poll-interval(default 5s) for AgenticTasks wherestatus.assignedNode == myNodeName && phase == Scheduled.Executor.Executein a goroutine; v0.1 keeps one task per node in flight via a mutex'dinflightslot.phase=Failedwithverdict=INCOMPLETEand the error in theCompletedcondition. Defensive contract-violation path on nil error + nil Result.ErrWatcherStalledafterMaxConsecutiveFailures(default 3)List()failures in a row, mirroringpkg/agent/watcher.go's supervisor-restart pattern.Executor abstraction (
pkg/foreman/agent/{executor,executor_stub,result}.go):Executorinterface:Kind() string+Execute(ctx, task) (*Result, error). Small interface so M3 (native agent loop), M4 (gate-job executor), and any future kind plug in behind the same shape.Resultis theforeman.v1envelope shared between executors and any downstream consumer (planner evaluator, future DecisionLog, the human reviewer): SchemaVersion, Kind, Verdict, Summary, Extra, ElapsedSec.AgenticTaskVerdictconstants added toapi/foreman/v1alpha1(GO, NO-GO, INCOMPLETE, GATE-PASS, GATE-FAIL, GATE-ERROR). No CRD schema change; the enum validator was already on the type.StubExecutor: sleeps for--stub-sleep(default 10s) and returns a synthetic GO-verdict Result. Used to validate dispatch end-to-end today; M3 swaps in the native agent loop behind the same interface.Binary wiring (
cmd/foreman-agent/main.go):--task-poll-interval,--task-namespace,--stub-sleep.errgroup. On clean shutdown via SIGTERM both return nil; binary exits cleanly.Tests: the M0 stub-smoke test in
agentictask_controller_test.gois rewritten to exercise the M2 contract (Pending → Scheduled with capability matching, cascade-fail, dependency waits, busy-node skip). Existing M0+M1 tests untouched. Envtest harness from #501 carries forward.Verification
Live demo on
kind-llmkube-localwithforeman-operator+foreman-agent(--heartbeat-interval=3s --task-poll-interval=2s --stub-sleep=8s) + apply ofexamples/foreman/m2-stub-task.yaml. Observed the full conditions trail in order:Final
status.result:{ "schemaVersion": "foreman.v1", "kind": "stub", "verdict": "GO", "elapsedSec": 8.001, "extra": { "taskKind": "freeform", "agentName": "stub", "modelRef": "" }, "summary": "stub executor slept 8.001s on task default/m2-stub-demo" }make testpasses on both arches.make lint0 issues on darwin and onGOOS=linux(addresses the cross-arch gap captured in #503). No regressions in the LLMKube core inference flow.Checklist
make testpasses locally (both arches)make lintpasses locally (both arches)git commit -s) per DCOWhat's next
M3 (#500): build the native agent loop in Go using OpenAI function calling, land the Agent CRD, ship the
minimax-coderAgent. The retry policy for the upstream llama.cpp tools-strict-parse 500 (#22072) is the only architectural decision M3 takes on top of what M2 already proves works.