Skip to content

feat(middleware): Model routing, PII filtering, Cloud model proxies#9802

Open
richiejp wants to merge 7 commits into
mudler:masterfrom
richiejp:feat/routing-stats-backend
Open

feat(middleware): Model routing, PII filtering, Cloud model proxies#9802
richiejp wants to merge 7 commits into
mudler:masterfrom
richiejp:feat/routing-stats-backend

Conversation

@richiejp
Copy link
Copy Markdown
Collaborator

@richiejp richiejp commented May 13, 2026

Allows analyzing requests then routing, filtering and transforming them.

Chat requests can be classified and labelled as requiring particular capabilities.
Then routed to the model which satisfies all of the capabilities. Naturally requests that require fewer capabilities can be handled by smaller specialized models. In addition the classifier chooses more capabilities the more uncertain it is, routing difficult requests to larger general purpose models.

Classification is very fast, but once requests have been classified their embeddings can be used to avoid classifying similar requests. This works by labelling the embeddings of past requests and then doing a cosine similarity search on the embeddings of new requests.

image

Private information can be detected, when it is found in the request, the request can be modified to redact it,
routed differently or it can be blocked.

image image

Cloud models and a MITM proxy can be configured and take part in filtering and routing.
This allows sending easy requests to smaller local models and hard ones to cloud models.
The MITM proxy allows you to use Claude Code or Codex subscriptions (OAuth) with the PII
filter and potentially even with routing (although this is limited by the cloud providers ToS).

image

Routing classifies requests using a model such as ArchRouter which labels a request.
We score each request on the possible capabilities it may require and pick a model which
has all of the capabilities with scores towards the top of the distribution.

image

The ability to score multiple choices is an interesting feature in its own right.
It allows you to very quickly check with what probability an LLM would produce a particular
answer.

  • feat(routing): add billing recorder and stats backend foundation
  • feat(routing): expose usage stats in REST, UI, and MCP
  • feat(routing): add regex PII filter with REST and MCP surfaces
  • feat(routing): record usage end-to-end in no-auth mode
  • feat(routing): per-model PII gating + middleware admin page
  • feat(routing): rule-based intelligent router (subsystem 2 MVP)
  • feat(routing): streaming PII filter with buffered-emit invariant
  • feat(routing): PII pattern editor in model config UI
  • feat(routing): streaming PII filter on Anthropic /v1/messages and /v1/completions
  • feat(routing): cloud passthrough proxy (subsystem 4 MVP)
  • docs(routing): cloud passthrough proxy feature page
  • feat(routing): MITM proxy for subscription-auth Claude Code / Codex
  • feat(mitm): negotiate HTTP/2 with h1.1 fallback
  • refactor(cloudproxy): extract shared SSE wire helpers, trim dead state and comments
  • feat(import-model): add cloud-proxy templates to YAML editor
  • Revert "feat(import-model): add cloud-proxy templates to YAML editor"
  • feat(model-editor): add cloud-proxy templates to Add Model picker
  • feat(mitm): runtime control of listener and intercept allowlist
  • feat(middleware-ui): MITM proxy admin tab
  • refactor(mitm): simplify-pass cleanup
  • feat(mitm): emit proxy_connect + proxy_traffic audit events
  • test(mitm): cover tunneled-host event + Events tab kind filter
  • fix(mitm): restore listener from runtime_settings.json on restart
  • fix(routing): address code-review findings across pii/mitm/router
  • feat(middleware): per-pattern PII toggle, model-config-owned MITM hosts
  • refactor(store/local): extract in-process vector store library
  • feat(routing): KNN + LLM classifiers and per-model admission control
  • refactor(store): keep the vector store out of the main process
  • feat(backend): TokenClassify RPC + transformers NER pipeline
  • fix(openai): add missing auth import to chat.go
  • feat(pii): NER tier in the redactor
  • feat(middleware-ui): router template + Create routing model link
  • fix(model-editor): code-editor crash on structured template values
  • feat(model-editor): structured router-candidates editor + proxy chat usecase
  • fix(router-candidates): one textarea per exemplar, multi-line-safe
  • feat(router): KNN consumes a benchmarker-produced routing dataset
  • docs(router): recommend nomic-embed-text-v1.5 over Longformer
  • feat(routing): Score gRPC primitive, score classifier, L2 embedding cache

@richiejp richiejp force-pushed the feat/routing-stats-backend branch 4 times, most recently from aff5af4 to 8389d96 Compare May 13, 2026 14:54
@richiejp richiejp force-pushed the feat/routing-stats-backend branch 3 times, most recently from d8b32b7 to d82ad5c Compare May 19, 2026 09:49
richiejp and others added 3 commits May 21, 2026 11:57
Big-bang squash-friendly commit covering the work since master:
phases 1-7 of the cloud-proxy migration, tool-call support, plus
the surrounding routing / middleware / PII / billing scaffolding
this branch had been carrying.

Cloud-proxy backend (backend/go/cloud-proxy/):
  * New gRPC backend with two modes.
  * Passthrough: Forward RPC shovels raw HTTP between client and
    upstream so the wire format is preserved byte-for-byte.
  * Translate: PredictRich / PredictStreamRich convert internal
    proto to OpenAI Chat Completions or Anthropic Messages,
    preserving tool calls + usage tokens through pb.Reply.
  * API keys resolved from api_key_env or api_key_file (mutually
    exclusive), never stored in YAML.

gRPC interface (pkg/grpc/):
  * Forward bidi RPC added to Backend proto.
  * AIModelRich optional extension interface returning *pb.Reply
    so backends can surface tool_calls and usage tokens.
  * Fixed forwardClient.CloseSend prematurely closing the gRPC
    connection — caught by e2e tests. Cleanup now fires on stream
    end (Recv error/EOF) instead.

Core integration:
  * IsCloudProxyBackendPassthrough hook in chat + Anthropic
    endpoints; legacy "proxy-*" backend prefix removed (hard
    cutover — nothing released).
  * cloudproxy.ForwardViaBackend + cloudproxy.BuildStreamFilter
    shared by both endpoint families.
  * PII filter applies to translate mode via the standard
    streaming pipeline; verified by e2e.

Routing + middleware (carried from earlier on the branch):
  * Score / Rerank / Embedder / VectorStore interfaces in
    core/backend with Application factory methods.
  * Router with score classifier, depth-1 invariant, embedding
    cache, PII config, billing recorder.
  * Admission middleware, route-model dispatch, usage stamping.
  * MITM proxy + CA management for intercepting cloud traffic.
  * Middleware admin page in the React UI.

Local-store backend rewrite + tests covering Set / Get /
Delete / Find invariants.

Llama-cpp Score concurrency guard: conflict_guard tripwire
plus FLAG_SCORE/{CHAT,COMPLETION,EMBEDDINGS} validation rule
in core/config.

Tests: 60+ new unit tests across cloud-proxy backend, cloudproxy
core glue, gRPC server + AIModelRich dispatch, config validation,
and 6 e2e specs that stand up a real two-process gRPC link with
fake upstreams (gaps mudler#1/mudler#2/mudler#3 from review).

Docs: cloud-proxy.md, middleware.md, mitm-proxy.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
`go build ./...` (and other multi-package builds that include
backend/go/cloud-proxy or backend/go/local-store) writes a binary
named after the package directory into the working directory. Add
both names to the existing root-binary ignore block so the working
tree stays clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds katanemo/Arch-Router-1.5B (mradermacher GGUFs) as a compact routing
LLM that pairs with the router classifier as a preference-aligned
alternative to embedding/ColBERT-based routing.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch]
@richiejp richiejp force-pushed the feat/routing-stats-backend branch from 5bef3d3 to dfa8619 Compare May 21, 2026 10:57
richiejp added 4 commits May 21, 2026 11:57
Wire cloud-proxy into the BACKEND_* generator so docker-build-cloud-proxy,
docker-save-cloud-proxy, and backends/cloud-proxy work like every other
golang backend. Same profile as local-store (golang|.|false|true).

Assisted-by: Claude Code:claude-opus-4-7 [Edit] [Bash]
The router middleware silently swallowed classifier build errors and
fell through to cfg.Router.Fallback, so misconfigured routers looked
"working" while the classifier model was never invoked — operators saw
zero classifier latency and no backend traces with no clue why. Drop
the silent fallback for build-time errors and return 503 with the
underlying reason. Classify-time errors and label-coverage misses still
use the fallback.

The score usecase (required on the classifier_model via
known_usecases: [score]) was also missing from the model-editor
dropdown and from the Arch-Router gallery entries, which inherited
known_usecases: [chat] from chatml.yaml. UsecaseOptions now includes
score and the six other flags that were similarly missing (vision,
face_recognition, speaker_recognition, audio_transform, diarization,
realtime_audio); the arch-router gallery entries override
known_usecases to [score] so they install in a usable state.

Assisted-by: Claude Code:claude-opus-4-7 [Edit] [Bash]
HasUsecases was purely additive — declared `known_usecases` plus
whatever GuessUsecases inferred from backend/templates. A model with
`known_usecases: [score]` that inherited a chat template (e.g.
arch-router off chatml.yaml) would still report HasUsecases(FLAG_CHAT) =
true via the heuristic, surfacing as a chat model in pickers it was
deliberately reserved out of.

Score is already special — the heuristic refuses to guess it because
score-intent is a deliberate reservation. Extend that special-casing so
that *if* score is declared, the entire known_usecases list becomes
authoritative: the heuristic must not paint chat/completion/embeddings
on top. Other declarations remain additive (existing test `i` still
passes).

Assisted-by: Claude Code:claude-opus-4-7 [Edit] [Bash]
Two paths previously ran silent — the router's Score RPC and the
cloud-proxy passthrough Forward RPC — so operators debugging a misrouted
or 4xx'd request had no visibility:

  - Score: ScoreClassifier called modelScorer.Score → backend gRPC with
    no xlog and no trace.RecordBackendTrace. The Traces UI showed only
    the downstream LLM call, never the classifier that picked it.

  - Cloud-proxy Forward: passthrough mode bypasses core/backend/llm.go
    and its trace site, so 4xx responses left no log line and no trace
    row. Adding to the confusion, the cloud-proxy backend itself emitted
    no per-request output and ANSI-coloured its stdout (xlog's default
    handler), so the captured Backend Logs were near-empty and what was
    there had escape sequences.

Changes:

  - backend/go/cloud-proxy/main.go: default LOCALAI_LOG_FORMAT to text
    when stdout isn't a TTY (it never is for a spawned backend).

  - backend/go/cloud-proxy/proxy.go: xlog.Info on Predict / PredictStream
    / Forward entry with upstream + provider + upstream_model; xlog.Warn
    on failure or 4xx response.

  - core/services/cloudproxy/backend_forward.go: promote the existing
    Debug log to Info/Warn with upstream_model; record a BackendTraceLLM
    trace via defer covering the full forward duration.

  - core/services/routing/router/score.go: xlog.Info after each
    classification with active labels + top label + latency.

  - core/trace/backend_trace.go + core/backend/score.go: new
    BackendTraceScore type; modelScorer.Score records a trace row with
    prompt summary + candidate scores so router classifications show up
    in the Traces UI alongside the LLM calls they gated.

Assisted-by: Claude Code:claude-opus-4-7 [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants