Skip to content

Add experimental thv audit-trifecta group auditor#5403

Open
JAORMX wants to merge 2 commits into
mainfrom
toxic-flow-audit
Open

Add experimental thv audit-trifecta group auditor#5403
JAORMX wants to merge 2 commits into
mainfrom
toxic-flow-audit

Conversation

@JAORMX
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX commented May 29, 2026

Summary

When a single agent context combines access to private data, exposure to
untrusted content, and a way to send data out — Simon Willison's "lethal
trifecta" — a prompt injection can read secrets and exfiltrate them. A ToolHive
group already defines the set of MCP servers that share one agent's context,
so it is the natural unit to detect this risk.

This adds a hidden, advisory-only thv audit-trifecta command (and a
pkg/toxicflow engine) that classifies each server in a group into
source / data / sink roles and reports whether the group forms a
source → data → sink toxic flow.

  • Roles are derived, not stored: data and sink come from the permission
    profile (filesystem reads / secrets → data; egress / remote / host-mode →
    sink) with real confidence; the source leg is deliberately raise-only
    a server can never self-declare safe, so it is never a confident "none" except
    via an explicit operator override.
  • Sink is open-by-default-aware: a confident "none" requires positive
    evidence of closed egress (a nil profile/network/outbound reads as open,
    matching the runtime), so the detector cannot emit a false "safe".
  • Verdict is gated by the weakest leg: present / possible /
    indeterminate / none.
  • Pluggable inference for the weak source leg: keyword matching by default
    (offline, deterministic), or an LLM when configured — explicit flags
    (--llm-model/--llm-base-url, key optional) or reuse of a running
    thv llm proxy (--llm-proxy). Inference hints are raise-only and capped
    at "possible"
    , so a wrong or hallucinated hint can only over-warn. The LLM
    receives only public catalog metadata, never profiles/secrets/config.
  • --live probes running (actively-proxied) servers for the authoritative
    openWorldHint signal through the ToolHive proxy.
  • Overrides correct mis-classified servers; they are validated (role /
    confidence / required reason), authoritative, and surfaced in output.

Type of change

  • New feature

Test plan

  • Linting (task lint-fix)
  • Manual testing (describe below)

Unit tests for pkg/toxicflow and pkg/llm/client pass (table-driven, covering
the classifier branches, weakest-leg verdict, raise-only/capped hint folding,
override validation, and the LLM client). Manually ran thv audit-trifecta
against a real local group in static, --live, and LLM (OpenRouter
gpt-4o-mini)
modes; verified verdicts, evidence, self-contained-flow
remediation, graceful degradation when inference/proxy is unavailable, and the
unknown-group error path. --llm-proxy against a live thv llm gateway was not
e2e-verified (no gateway configured in the test environment).

API Compatibility

  • This PR does not break the v1beta1 API.

Does this introduce a user-facing change?

No — the command is Hidden and experimental (not shown in thv --help),
gated while the classification heuristics are calibrated.

Special notes for reviewers

  • Size: this is one cohesive feature but exceeds the 400-line / 10-file
    guideline (mostly new code + tests behind a hidden flag). Happy to split into
    (a) pkg/llm/client + ProxyBaseURL, (b) the pkg/toxicflow engine +
    command + skill, or static-only first then the LLM/--live layer — let me
    know your preference.
  • Safety model: the load-bearing invariant is that the detector never emits
    a false "safe". It is enforced structurally — source is raise-only, hints are
    clamped to "possible" at the single applyHints choke point, and every
    best-effort failure degrades toward indeterminate, never none.
  • Design choices: hand-rolled OpenAI-compatible client (chat/completions for
    maximum portability; the Responses API and multi-provider libraries were
    researched and rejected as narrowing/heavyweight for one stateless call). The
    thv llm gateway must accept OpenAI-shaped requests for --llm-proxy.
  • Developed and reviewed iteratively with multi-agent panels (spec / standards /
    security / architecture / library-reuse).

🤖 Generated with Claude Code

Large PR Justification

This is one cohesive, self-contained feature and the size is dominated by new
code plus its tests, not by churn in existing code:

  • Low blast radius. The command is Hidden and advisory-only (it reports,
    never blocks), and almost everything new lives in a brand-new isolated package
    (pkg/toxicflow) plus a small generic client (pkg/llm/client). Changes to
    existing files are tiny: one AddCommand line, and a one-method
    ProxyBaseURL() helper reused by thv llm.
  • Tests are a large share of the diff. The pure classifier/analyzer and the
    LLM client are thoroughly unit-tested; the test files account for a big chunk
    of the line count.
  • The natural seams don't stand alone well. The engine is meaningless
    without the command that drives it, and the command is a thin wrapper over the
    engine, so splitting them produces PRs that can't be evaluated independently.

Happy to split it on request into (1) pkg/llm/client + ProxyBaseURL,
(2) the pkg/toxicflow core engine + tests, and (3) collect/probe/inference +
command + skill, if reviewers would prefer that over reviewing it as one unit.

Co-located private-data access, exposure to untrusted content, and an
exfiltration path within one agent context (the "lethal trifecta") let a
prompt injection steal data and send it out. ToolHive groups already
define a set of MCP servers that share an agent's context, so they are
the natural place to detect this.

Add a hidden, advisory-only `thv audit-trifecta` command that classifies
each server in a group into source/data/sink roles and reports whether
the group forms a source -> data -> sink toxic flow. Roles are derived
from permission profiles and registry metadata; the data and sink legs
are inferred with confidence from the profile, while the source leg is
deliberately raise-only (a server can never self-declare safe).

An optional inference layer refines the weak source signal: keyword
matching by default, or an LLM when configured (explicit flags or the
running `thv llm` proxy). Inference hints are raise-only and capped, so a
wrong or hallucinated hint can only over-warn, never produce a false
"safe". `--live` probes running servers for openWorldHint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label May 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 73.19588% with 104 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.79%. Comparing base (6f63ac0) to head (c13c6b1).

Files with missing lines Patch % Lines
pkg/toxicflow/collect.go 12.30% 57 Missing ⚠️
pkg/toxicflow/probe.go 43.47% 13 Missing ⚠️
pkg/toxicflow/inference.go 80.70% 6 Missing and 5 partials ⚠️
pkg/llm/client/client.go 77.77% 5 Missing and 5 partials ⚠️
pkg/toxicflow/classify.go 92.36% 6 Missing and 4 partials ⚠️
pkg/toxicflow/types.go 83.33% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5403      +/-   ##
==========================================
+ Coverage   68.77%   68.79%   +0.01%     
==========================================
  Files         629      636       +7     
  Lines       63914    64301     +387     
==========================================
+ Hits        43955    44233     +278     
- Misses      16703    16797      +94     
- Partials     3256     3271      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codespell reads the camelCase identifier "atLeast" as the typo "at least". Rename the Confidence helper to a synonym that carries the same meaning without tripping the dictionary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 29, 2026
@github-actions github-actions Bot dismissed their stale review May 29, 2026 12:56

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant