Add experimental thv audit-trifecta group auditor by JAORMX · Pull Request #5403 · stacklok/toolhive

JAORMX · 2026-05-29T12:42:29Z

Summary

When a single agent context combines access to private data, exposure to
untrusted content, and a way to send data out — Simon Willison's "lethal
trifecta" — a prompt injection can read secrets and exfiltrate them. A ToolHive
group already defines the set of MCP servers that share one agent's context,
so it is the natural unit to detect this risk.

This adds a hidden, advisory-only thv audit-trifecta command (and a
pkg/toxicflow engine) that classifies each server in a group into
source / data / sink roles and reports whether the group forms a
source → data → sink toxic flow.

Roles are derived, not stored: data and sink come from the permission
profile (filesystem reads / secrets → data; egress / remote / host-mode →
sink) with real confidence; the source leg is deliberately raise-only —
a server can never self-declare safe, so it is never a confident "none" except
via an explicit operator override.
Sink is open-by-default-aware: a confident "none" requires positive
evidence of closed egress (a nil profile/network/outbound reads as open,
matching the runtime), so the detector cannot emit a false "safe".
Verdict is gated by the weakest leg: present / possible /
indeterminate / none.
Pluggable inference for the weak source leg: keyword matching by default
(offline, deterministic), or an LLM when configured — explicit flags
(--llm-model/--llm-base-url, key optional) or reuse of a running
thv llm proxy (--llm-proxy). Inference hints are raise-only and capped
at "possible", so a wrong or hallucinated hint can only over-warn. The LLM
receives only public catalog metadata, never profiles/secrets/config.
--live probes running (actively-proxied) servers for the authoritative
openWorldHint signal through the ToolHive proxy.
Overrides correct mis-classified servers; they are validated (role /
confidence / required reason), authoritative, and surfaced in output.

Type of change

New feature

Test plan

Linting (task lint-fix)
Manual testing (describe below)

Unit tests for pkg/toxicflow and pkg/llm/client pass (table-driven, covering
the classifier branches, weakest-leg verdict, raise-only/capped hint folding,
override validation, and the LLM client). Manually ran thv audit-trifecta
against a real local group in static, --live, and LLM (OpenRouter
gpt-4o-mini) modes; verified verdicts, evidence, self-contained-flow
remediation, graceful degradation when inference/proxy is unavailable, and the
unknown-group error path. --llm-proxy against a live thv llm gateway was not
e2e-verified (no gateway configured in the test environment).

API Compatibility

This PR does not break the v1beta1 API.

Does this introduce a user-facing change?

No — the command is Hidden and experimental (not shown in thv --help),
gated while the classification heuristics are calibrated.

Special notes for reviewers

Size: this is one cohesive feature but exceeds the 400-line / 10-file
guideline (mostly new code + tests behind a hidden flag). Happy to split into
(a) pkg/llm/client + ProxyBaseURL, (b) the pkg/toxicflow engine +
command + skill, or static-only first then the LLM/--live layer — let me
know your preference.
Safety model: the load-bearing invariant is that the detector never emits
a false "safe". It is enforced structurally — source is raise-only, hints are
clamped to "possible" at the single applyHints choke point, and every
best-effort failure degrades toward indeterminate, never none.
Design choices: hand-rolled OpenAI-compatible client (chat/completions for
maximum portability; the Responses API and multi-provider libraries were
researched and rejected as narrowing/heavyweight for one stateless call). The
thv llm gateway must accept OpenAI-shaped requests for --llm-proxy.
Developed and reviewed iteratively with multi-agent panels (spec / standards /
security / architecture / library-reuse).

🤖 Generated with Claude Code

Large PR Justification

This is one cohesive, self-contained feature and the size is dominated by new
code plus its tests, not by churn in existing code:

Low blast radius. The command is Hidden and advisory-only (it reports,
never blocks), and almost everything new lives in a brand-new isolated package
(pkg/toxicflow) plus a small generic client (pkg/llm/client). Changes to
existing files are tiny: one AddCommand line, and a one-method
ProxyBaseURL() helper reused by thv llm.
Tests are a large share of the diff. The pure classifier/analyzer and the
LLM client are thoroughly unit-tested; the test files account for a big chunk
of the line count.
The natural seams don't stand alone well. The engine is meaningless
without the command that drives it, and the command is a thin wrapper over the
engine, so splitting them produces PRs that can't be evaluated independently.

Happy to split it on request into (1) pkg/llm/client + ProxyBaseURL,
(2) the pkg/toxicflow core engine + tests, and (3) collect/probe/inference +
command + skill, if reviewers would prefer that over reviewing it as one unit.

Co-located private-data access, exposure to untrusted content, and an exfiltration path within one agent context (the "lethal trifecta") let a prompt injection steal data and send it out. ToolHive groups already define a set of MCP servers that share an agent's context, so they are the natural place to detect this. Add a hidden, advisory-only `thv audit-trifecta` command that classifies each server in a group into source/data/sink roles and reports whether the group forms a source -> data -> sink toxic flow. Roles are derived from permission profiles and registry metadata; the data and sink legs are inferred with confidence from the profile, while the source leg is deliberately raise-only (a server can never self-declare safe). An optional inference layer refines the weak source signal: keyword matching by default, or an LLM when configured (explicit flags or the running `thv llm` proxy). Inference hints are raise-only and capped, so a wrong or hallucinated hint can only over-warn, never produce a false "safe". `--live` probes running servers for openWorldHint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.

This review will be automatically dismissed once you add the justification section.

codecov · 2026-05-29T12:48:44Z

Codecov Report

❌ Patch coverage is 73.19588% with 104 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.79%. Comparing base (6f63ac0) to head (c13c6b1).

Files with missing lines	Patch %	Lines
pkg/toxicflow/collect.go	12.30%	57 Missing ⚠️
pkg/toxicflow/probe.go	43.47%	13 Missing ⚠️
pkg/toxicflow/inference.go	80.70%	6 Missing and 5 partials ⚠️
pkg/llm/client/client.go	77.77%	5 Missing and 5 partials ⚠️
pkg/toxicflow/classify.go	92.36%	6 Missing and 4 partials ⚠️
pkg/toxicflow/types.go	83.33%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5403      +/-   ##
==========================================
+ Coverage   68.77%   68.79%   +0.01%     
==========================================
  Files         629      636       +7     
  Lines       63914    64301     +387     
==========================================
+ Hits        43955    44233     +278     
- Misses      16703    16797      +94     
- Partials     3256     3271      +15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codespell reads the camelCase identifier "atLeast" as the typo "at least". Rename the Confidence helper to a synonym that carries the same meaning without tripping the dictionary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Large PR justification has been provided. Thank you!

github-actions · 2026-05-29T12:56:26Z

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

JAORMX requested review from ChrisJBurns, amirejaz, jhrozek, lujunsan, rdimitrov and yrobla as code owners May 29, 2026 12:42

github-actions Bot previously requested changes May 29, 2026

View reviewed changes

github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label May 29, 2026

github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental thv audit-trifecta group auditor#5403

Add experimental thv audit-trifecta group auditor#5403
JAORMX wants to merge 2 commits into
mainfrom
toxic-flow-audit

JAORMX commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

codecov Bot commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JAORMX commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Test plan

API Compatibility

Does this introduce a user-facing change?

Special notes for reviewers

Large PR Justification

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Large PR Detected

How to unblock this PR:

Alternative:

Uh oh!

codecov Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JAORMX commented May 29, 2026 •

edited

Loading

codecov Bot commented May 29, 2026 •

edited

Loading