Feature Description
Add timeout (*metav1.Duration) to ModelRouter.spec.rules[] and
ModelRouter.spec.backends[]. At dispatch time the proxy applies
the timeout via a per-request context.WithTimeout with the
resolution order rule.timeout || backend.timeout || proxy default.
Problem Statement
As a platform operator with rules of mixed latency profiles in a
single ModelRouter, I want each rule and backend to have its own
timeout budget, so PII rules can fail-fast while long reasoning
rules can be patient.
Today there is exactly one proxy-side timeout
(ResponseHeaderTimeout, currently 30s). This collapses three
operationally distinct profiles into one:
- Strict / fail-fast rules (sensitive data, health, sync agent
steps): want sub-10s caps. Hung backend should fail fast so the
caller falls over to its own retry.
- Normal interactive chat (default route): 30-60s.
- Deliberately slow reasoning (
taskComplexity: complex,
Anthropic Claude thinking, DeepSeek-V4 reasoning): 120s+.
Per-backend overrides matter too: in-cluster Gemma 12B at ~30tok/s
and a remote Anthropic API with global LB in front have different
P99 envelopes for the same prompt.
Proposed Solution
Per-rule
rules:
- name: pii-stays-local
match: { dataClassification: [pii] }
route: { backends: [local-chat] }
failClosed: true
timeout: 8s # NEW
- name: complex-to-chat
match: { taskComplexity: complex }
route:
backends: [local-chat, local-coder]
strategy: primary-fallback
timeout: 120s # NEW
spec.rules[].timeout (*metav1.Duration, optional). Overrides
the proxy default for any dispatch matched by this rule. Applied
via per-request context.WithTimeout, not by mutating the shared
transport.
Per-backend
backends:
- name: cloud-anthropic
external:
provider: anthropic
model: claude-opus-4-7
tier: cloud
timeout: 180s # NEW
spec.backends[].timeout (*metav1.Duration, optional).
Resolution order
rule.timeout wins over backend.timeout wins over the proxy
default (set in #457).
Status surface
status.rules[].observedTimeout reflects what the proxy is
actually using after resolution. Helps catch CRD-to-runtime drift
via kubectl describe.
Alternatives Considered
Additional Context
Priority
Willingness to Contribute
Feature Description
Add
timeout(*metav1.Duration) toModelRouter.spec.rules[]andModelRouter.spec.backends[]. At dispatch time the proxy appliesthe timeout via a per-request
context.WithTimeoutwith theresolution order
rule.timeout || backend.timeout || proxy default.Problem Statement
Today there is exactly one proxy-side timeout
(
ResponseHeaderTimeout, currently 30s). This collapses threeoperationally distinct profiles into one:
steps): want sub-10s caps. Hung backend should fail fast so the
caller falls over to its own retry.
taskComplexity: complex,Anthropic Claude thinking, DeepSeek-V4 reasoning): 120s+.
Per-backend overrides matter too: in-cluster Gemma 12B at ~30tok/s
and a remote Anthropic API with global LB in front have different
P99 envelopes for the same prompt.
Proposed Solution
Per-rule
spec.rules[].timeout(*metav1.Duration, optional). Overridesthe proxy default for any dispatch matched by this rule. Applied
via per-request
context.WithTimeout, not by mutating the sharedtransport.
Per-backend
spec.backends[].timeout(*metav1.Duration, optional).Resolution order
rule.timeoutwins overbackend.timeoutwins over the proxydefault (set in #457).
Status surface
status.rules[].observedTimeoutreflects what the proxy isactually using after resolution. Helps catch CRD-to-runtime drift
via
kubectl describe.Alternatives Considered
--response-header-timeout), no per-rule:shipped in [FEATURE] router-proxy: configurable ResponseHeaderTimeout (default 30s too short for long-form LLM) #457. Sufficient for single-purpose routers; falls
apart on mixed-workload routers where PII and long reasoning
share a config.
ModelRouter mixes routes.
x-llmkube-timeout): leaks operatorpolicy to application teams. Wrong layer.
Additional Context
[BUG] router-proxy: stale HTTP keep-alive conns cause 30s silent stalls instead of fast failover #459 (keep-alive pool health — addresses the same symptom from a
different angle).
hack/demo-modelrouter.sh: the PII rule and thecomplex-reasoning rule both target Gemma but want very different
timeouts. PII shouldn't wait 30s for a hung backend before fail-
closing; complex reasoning shouldn't 503 on a healthy 60s call.
Priority
single-timeout is awkward.
Willingness to Contribute