Skip to content

model routing: cost/latency ranking with ranked fallback list#849

Merged
adilhafeez merged 13 commits intomainfrom
adil/top-level-routing-preferences
Mar 30, 2026
Merged

model routing: cost/latency ranking with ranked fallback list#849
adilhafeez merged 13 commits intomainfrom
adil/top-level-routing-preferences

Conversation

@adilhafeez
Copy link
Copy Markdown
Contributor

@adilhafeez adilhafeez commented Mar 27, 2026

Summary

  • Top-level routing_preferences (v0.4.0+) with candidate model list and selection_policy
  • /routing/v1/* returns ranked models[] array; client uses models[0], falls back on 429/5xx
  • selection_policy.prefer: cheapest, fastest, random, none
  • model_metrics_sources: cost_metrics, prometheus_metrics, digitalocean_pricing (public DO catalog with model_aliases)
  • Startup errors for missing metric sources; startup + request-time warnings for unmatched models
  • Dropped legacy per-provider routing format
  • Demo updated to v0.4.0 with docker-compose (Prometheus + mock latency server)

fixes #848

- MetricsSource::DigitalOceanPricing variant: fetch public DO Gen-AI pricing, normalize as lowercase(creator)/model_id, cost = input + output per million
- cost_metrics endpoint format updated to { "model": { "input_per_million": X, "output_per_million": Y } }
- Startup errors: prefer:cheapest requires cost source, prefer:fastest requires prometheus
- Startup warning: models with no pricing/latency data ranked last
- One-per-type enforcement: digitalocean_pricing; error if cost_metrics + digitalocean_pricing both configured
- cost_snapshot() / latency_snapshot() on ModelMetricsService for startup checks
- Demo config updated to v0.4.0 top-level routing_preferences with cheapest + fastest policies
- docker-compose.yaml + prometheus.yaml + metrics_server.py for demo latency metrics
- Schema and docs updated
@adilhafeez adilhafeez changed the title add top-level routing_preferences with selection_policy and model metrics fetch model routing: cost/latency ranking with ranked fallback list Mar 28, 2026
Copy link
Copy Markdown

@alex-paperspace alex-paperspace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Plano should only handle ranking that requires server-side data
(cost metrics, latency). Random shuffling is trivial for callers.
PR #851 introduced duplicate openai/gpt-4o entries and set
use_agent_orchestrator: true with multiple endpoints. Fixed by using
groq/llama-3.3-70b-versatile for the routing_preferences example
and setting use_agent_orchestrator: false.
Falls back to bare python when uv is not available (CI).
PR #851 added ratelimit examples using unit: day but the Rust
TimeUnit enum only had second/minute/hour. Adds Day variant and
maps it to per-hour quota (tokens/24).
@adilhafeez adilhafeez merged commit e5751d6 into main Mar 30, 2026
36 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: top-level routing_preferences with selection_policy and metrics fetch (v0.4.0)

2 participants