Add experimental thv audit-trifecta group auditor#5403
Conversation
Co-located private-data access, exposure to untrusted content, and an exfiltration path within one agent context (the "lethal trifecta") let a prompt injection steal data and send it out. ToolHive groups already define a set of MCP servers that share an agent's context, so they are the natural place to detect this. Add a hidden, advisory-only `thv audit-trifecta` command that classifies each server in a group into source/data/sink roles and reports whether the group forms a source -> data -> sink toxic flow. Roles are derived from permission profiles and registry metadata; the data and sink legs are inferred with confidence from the profile, while the source leg is deliberately raise-only (a server can never self-declare safe). An optional inference layer refines the weak source signal: keyword matching by default, or an LLM when configured (explicit flags or the running `thv llm` proxy). Inference hints are raise-only and capped, so a wrong or hallucinated hint can only over-warn, never produce a false "safe". `--live` probes running servers for openWorldHint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5403 +/- ##
==========================================
+ Coverage 68.77% 68.79% +0.01%
==========================================
Files 629 636 +7
Lines 63914 64301 +387
==========================================
+ Hits 43955 44233 +278
- Misses 16703 16797 +94
- Partials 3256 3271 +15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
codespell reads the camelCase identifier "atLeast" as the typo "at least". Rename the Confidence helper to a synonym that carries the same meaning without tripping the dictionary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Summary
When a single agent context combines access to private data, exposure to
untrusted content, and a way to send data out — Simon Willison's "lethal
trifecta" — a prompt injection can read secrets and exfiltrate them. A ToolHive
group already defines the set of MCP servers that share one agent's context,
so it is the natural unit to detect this risk.
This adds a hidden, advisory-only
thv audit-trifectacommand (and apkg/toxicflowengine) that classifies each server in a group intosource / data / sink roles and reports whether the group forms a
source → data → sinktoxic flow.profile (filesystem reads / secrets → data; egress / remote / host-mode →
sink) with real confidence; the source leg is deliberately raise-only —
a server can never self-declare safe, so it is never a confident "none" except
via an explicit operator override.
evidence of closed egress (a nil profile/network/outbound reads as open,
matching the runtime), so the detector cannot emit a false "safe".
present/possible/indeterminate/none.(offline, deterministic), or an LLM when configured — explicit flags
(
--llm-model/--llm-base-url, key optional) or reuse of a runningthv llmproxy (--llm-proxy). Inference hints are raise-only and cappedat "possible", so a wrong or hallucinated hint can only over-warn. The LLM
receives only public catalog metadata, never profiles/secrets/config.
--liveprobes running (actively-proxied) servers for the authoritativeopenWorldHintsignal through the ToolHive proxy.confidence / required reason), authoritative, and surfaced in output.
Type of change
Test plan
task lint-fix)Unit tests for
pkg/toxicflowandpkg/llm/clientpass (table-driven, coveringthe classifier branches, weakest-leg verdict, raise-only/capped hint folding,
override validation, and the LLM client). Manually ran
thv audit-trifectaagainst a real local group in static,
--live, and LLM (OpenRoutergpt-4o-mini) modes; verified verdicts, evidence, self-contained-flowremediation, graceful degradation when inference/proxy is unavailable, and the
unknown-group error path.
--llm-proxyagainst a livethv llmgateway was note2e-verified (no gateway configured in the test environment).
API Compatibility
v1beta1API.Does this introduce a user-facing change?
No — the command is
Hiddenand experimental (not shown inthv --help),gated while the classification heuristics are calibrated.
Special notes for reviewers
guideline (mostly new code + tests behind a hidden flag). Happy to split into
(a)
pkg/llm/client+ProxyBaseURL, (b) thepkg/toxicflowengine +command + skill, or static-only first then the LLM/
--livelayer — let meknow your preference.
a false "safe". It is enforced structurally — source is raise-only, hints are
clamped to "possible" at the single
applyHintschoke point, and everybest-effort failure degrades toward
indeterminate, nevernone.maximum portability; the Responses API and multi-provider libraries were
researched and rejected as narrowing/heavyweight for one stateless call). The
thv llmgateway must accept OpenAI-shaped requests for--llm-proxy.security / architecture / library-reuse).
🤖 Generated with Claude Code
Large PR Justification
This is one cohesive, self-contained feature and the size is dominated by new
code plus its tests, not by churn in existing code:
Hiddenand advisory-only (it reports,never blocks), and almost everything new lives in a brand-new isolated package
(
pkg/toxicflow) plus a small generic client (pkg/llm/client). Changes toexisting files are tiny: one
AddCommandline, and a one-methodProxyBaseURL()helper reused bythv llm.LLM client are thoroughly unit-tested; the test files account for a big chunk
of the line count.
without the command that drives it, and the command is a thin wrapper over the
engine, so splitting them produces PRs that can't be evaluated independently.
Happy to split it on request into (1)
pkg/llm/client+ProxyBaseURL,(2) the
pkg/toxicflowcore engine + tests, and (3) collect/probe/inference +command + skill, if reviewers would prefer that over reviewing it as one unit.