Skip to content

[FEATURE] ModelRouter: per-rule and per-backend latency timeout fields #458

@Defilan

Description

@Defilan

Feature Description

Add timeout (*metav1.Duration) to ModelRouter.spec.rules[] and
ModelRouter.spec.backends[]. At dispatch time the proxy applies
the timeout via a per-request context.WithTimeout with the
resolution order rule.timeout || backend.timeout || proxy default.

Problem Statement

As a platform operator with rules of mixed latency profiles in a
single ModelRouter, I want each rule and backend to have its own
timeout budget, so PII rules can fail-fast while long reasoning
rules can be patient.

Today there is exactly one proxy-side timeout
(ResponseHeaderTimeout, currently 30s). This collapses three
operationally distinct profiles into one:

  • Strict / fail-fast rules (sensitive data, health, sync agent
    steps): want sub-10s caps. Hung backend should fail fast so the
    caller falls over to its own retry.
  • Normal interactive chat (default route): 30-60s.
  • Deliberately slow reasoning (taskComplexity: complex,
    Anthropic Claude thinking, DeepSeek-V4 reasoning): 120s+.

Per-backend overrides matter too: in-cluster Gemma 12B at ~30tok/s
and a remote Anthropic API with global LB in front have different
P99 envelopes for the same prompt.

Proposed Solution

Per-rule

rules:
  - name: pii-stays-local
    match: { dataClassification: [pii] }
    route: { backends: [local-chat] }
    failClosed: true
    timeout: 8s            # NEW

  - name: complex-to-chat
    match: { taskComplexity: complex }
    route:
      backends: [local-chat, local-coder]
      strategy: primary-fallback
    timeout: 120s          # NEW

spec.rules[].timeout (*metav1.Duration, optional). Overrides
the proxy default for any dispatch matched by this rule. Applied
via per-request context.WithTimeout, not by mutating the shared
transport.

Per-backend

backends:
  - name: cloud-anthropic
    external:
      provider: anthropic
      model: claude-opus-4-7
    tier: cloud
    timeout: 180s          # NEW

spec.backends[].timeout (*metav1.Duration, optional).

Resolution order

rule.timeout wins over backend.timeout wins over the proxy
default (set in #457).

Status surface

status.rules[].observedTimeout reflects what the proxy is
actually using after resolution. Helps catch CRD-to-runtime drift
via kubectl describe.

Alternatives Considered

Additional Context

Priority

  • High - Would significantly improve my workflow
  • Mixed-workload ModelRouters are the common enterprise case;
    single-timeout is awkward.

Willingness to Contribute

  • Yes, I can submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions