delegate-to-ai: role-alias route table + local-offload criteria
Context
The local MLX serving stack (llama-swap on 127.0.0.1:11434, Bifrost gateway at
localhost:30080) now serves a fast, crash-free MoE workhorse (~80 tok/s decode,
stable under 4-way concurrency, measured 2026-06-09 on the M4 Max). The
delegate-to-ai skill's route table still lists stale physical model ids
(Qwen3-235B-A22B, Qwen3.5-122B, Qwen3.5-27B) that are no longer registered.
Changes
-
Route table → role aliases. Replace physical model ids with the llama-swap role
aliases, which survive serving-side model swaps:
mlx-local/default — general summarize/extract/classify workhorse
mlx-local/quickest — shortest-latency small model
mlx-local/coding — code-leaning tasks
- Paths: Bifrost
http://localhost:30080/v1/chat/completions, or PAL chat with
mlx-local/default.
-
Add a "Local offload criteria" section, e.g.:
Route to the local model when ALL hold:
- task is summarization / extraction / classification / boilerplate / bulk
mechanical transforms,
- input fits ~30K chars,
- no multi-step reasoning or correctness-critical judgment required.
Everything else stays on cloud models.
-
Keep cloud routing rules unchanged.
Verification
curl -s http://localhost:30080/v1/chat/completions -d '{"model":"mlx-local/default",...}'
returns 200 with a completion in a few seconds.
delegate-to-ai: role-alias route table + local-offload criteria
Context
The local MLX serving stack (llama-swap on
127.0.0.1:11434, Bifrost gateway atlocalhost:30080) now serves a fast, crash-free MoE workhorse (~80 tok/s decode,stable under 4-way concurrency, measured 2026-06-09 on the M4 Max). The
delegate-to-aiskill's route table still lists stale physical model ids(Qwen3-235B-A22B, Qwen3.5-122B, Qwen3.5-27B) that are no longer registered.
Changes
Route table → role aliases. Replace physical model ids with the llama-swap role
aliases, which survive serving-side model swaps:
mlx-local/default— general summarize/extract/classify workhorsemlx-local/quickest— shortest-latency small modelmlx-local/coding— code-leaning taskshttp://localhost:30080/v1/chat/completions, or PALchatwithmlx-local/default.Add a "Local offload criteria" section, e.g.:
Keep cloud routing rules unchanged.
Verification
curl -s http://localhost:30080/v1/chat/completions -d '{"model":"mlx-local/default",...}'returns 200 with a completion in a few seconds.