Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Full docs at [agentv.dev/docs](https://agentv.dev/docs/getting-started/introduct
- [Rubrics](https://agentv.dev/docs/evaluation/rubrics/) — structured criteria scoring
- [Targets](https://agentv.dev/docs/targets/configuration/) — configure agents and providers
- [Compare results](https://agentv.dev/docs/tools/compare/) — A/B testing and regression detection
- [Comparison with other frameworks](https://agentv.dev/docs/reference/comparison/) — vs Braintrust, Langfuse, LangSmith, LangWatch
- [Ecosystem](https://agentv.dev/docs/reference/comparison/) — how AgentV fits with Agent Control and Langfuse

## Development

Expand Down
68 changes: 18 additions & 50 deletions apps/web/src/components/Lander.astro
Original file line number Diff line number Diff line change
Expand Up @@ -187,72 +187,40 @@ tests:
</div>
</section>

<!-- Comparison Section -->
<!-- Ecosystem Section -->
<section class="av-comparison">
<div class="av-container">
<h2 class="av-section-heading">How AgentV Compares</h2>
<h2 class="av-section-heading">Built for the AI Agent Lifecycle</h2>
<div class="av-table-card av-reveal">
<div class="av-table-scroll">
<div class="av-table-fade"></div>
<table>
<thead>
<tr>
<th>Feature</th>
<th class="av-col-highlight">AgentV</th>
<th>LangWatch</th>
<th>LangSmith</th>
<th>LangFuse</th>
<th>Layer</th>
<th class="av-col-highlight">Tool</th>
<th>When</th>
<th>What it does</th>
</tr>
</thead>
<tbody>
<tr>
<td>Setup</td>
<td class="av-col-highlight"><code>npm install</code></td>
<td>Cloud account + API key</td>
<td>Cloud account + API key</td>
<td>Cloud account + API key</td>
<td>Evaluate</td>
<td class="av-col-highlight"><strong>AgentV</strong></td>
<td>Pre-production</td>
<td>Score agents, detect regressions, gate CI/CD</td>
</tr>
<tr>
<td>Server</td>
<td class="av-col-highlight">None (local)</td>
<td>Managed cloud</td>
<td>Managed cloud</td>
<td>Managed cloud</td>
<td>Govern</td>
<td><a href="https://github.com/agentcontrol/agent-control">Agent Control</a></td>
<td>Runtime</td>
<td>Enforce policies on agent actions</td>
</tr>
<tr>
<td>Privacy</td>
<td class="av-col-highlight">All local</td>
<td>Cloud-hosted</td>
<td>Cloud-hosted</td>
<td>Cloud-hosted</td>
</tr>
<tr>
<td>CLI-first</td>
<td class="av-col-highlight"><span class="av-check-badge">&#10003;</span></td>
<td><span class="av-cross">&#10007;</span></td>
<td>Limited</td>
<td>Limited</td>
</tr>
<tr>
<td>CI/CD ready</td>
<td class="av-col-highlight"><span class="av-check-badge">&#10003;</span></td>
<td>Requires API calls</td>
<td>Requires API calls</td>
<td>Requires API calls</td>
</tr>
<tr>
<td>Version control</td>
<td class="av-col-highlight"><span class="av-check-badge">&#10003;</span> YAML in Git</td>
<td><span class="av-cross">&#10007;</span></td>
<td><span class="av-cross">&#10007;</span></td>
<td><span class="av-cross">&#10007;</span></td>
</tr>
<tr>
<td>Evaluators</td>
<td class="av-col-highlight">Code + LLM + Custom</td>
<td>LLM only</td>
<td>LLM + Code</td>
<td>LLM only</td>
<td>Observe</td>
<td><a href="https://github.com/langfuse/langfuse">Langfuse</a></td>
<td>Runtime</td>
<td>Trace execution, monitor production</td>
</tr>
</tbody>
</table>
Expand Down
197 changes: 77 additions & 120 deletions apps/web/src/content/docs/docs/reference/comparison.mdx
Original file line number Diff line number Diff line change
@@ -1,126 +1,83 @@
---
title: Comparison
description: How AgentV compares to other evaluation frameworks.
title: Ecosystem
description: How AgentV fits into the AI agent lifecycle alongside complementary tools.
---

## Quick Comparison

| Aspect | **AgentV** | **Braintrust** | **Langfuse** | **LangSmith** | **LangWatch** | **Google ADK** | **Mastra** | **OpenCode Bench** |
|--------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| **Primary Focus** | Agent evaluation & testing | Evaluation + logging | Observability + evaluation | Observability + evaluation | LLM ops & evaluation | Agent development | Agent/workflow development | Coding agent benchmarking |
| **Language** | TypeScript/CLI | Python/TypeScript | Python/JavaScript | Python/JavaScript | Python/JavaScript | Python | TypeScript | Python/CLI |
| **Deployment** | Local (CLI-first) | Cloud | Cloud/self-hosted | Cloud only | Cloud/self-hosted/hybrid | Local/Cloud Run | Local/server | Benchmarking service |
| **Self-contained** | Yes | No (cloud) | No (requires server) | No (cloud-only) | No (requires server) | Yes | Yes (optional) | No (requires service) |
| **Evaluation Focus** | Core feature | Core feature | Yes | Yes | Core feature | Minimal | Secondary | Core feature |
| **Judge Types** | Code + LLM (custom prompts) | Code + LLM (custom) | LLM-as-judge only | LLM-based + custom | LLM + real-time | Built-in metrics | Built-in (minimal) | Multi-judge LLM (3 judges) |
| **CLI-First** | Yes | No (SDK-first) | Dashboard-first | Dashboard-first | Dashboard-first | Code-first | Code-first | Service-based |
| **Open Source** | MIT | Closed source | Apache 2.0 | Closed | Closed | Apache 2.0 | MIT | Open source |
| **Setup Time** | &lt; 2 min | 5+ min | 15+ min | 10+ min | 20+ min | 30+ min | 10+ min | 5-10 min |

## AgentV vs. Braintrust

| Feature | AgentV | Braintrust |
|---------|--------|-----------|
| **Evaluation** | Code + LLM (custom prompts) | Code + LLM (Autoevals library) |
| **Deployment** | Local (no server) | Cloud-only (managed) |
| **Open source** | MIT | Closed source |
| **Pricing** | Free | Free tier + paid plans |
| **CLI-first** | Yes | SDK-first (Python/TS) |
| **Custom judge prompts** | Markdown files (Git) | SDK-based |
| **Observability** | No | Yes (logging, tracing) |
| **Datasets** | YAML/JSONL in Git | Managed in platform |
| **CI/CD** | Native (exit codes) | API-based |
| **Collaboration** | Git-based | Web dashboard |

**Choose AgentV if:** You want local-first evaluation, open source, version-controlled evals in Git.
**Choose Braintrust if:** You want a managed platform with built-in logging, datasets, and team collaboration.

## AgentV vs. Langfuse

| Feature | AgentV | Langfuse |
|---------|--------|----------|
| **Evaluation** | Code + LLM (custom prompts) | LLM only |
| **Local execution** | Yes | No (requires server) |
| **Speed** | Fast (no network) | Slower (API round-trips) |
| **Setup** | `npm install` | Docker + database |
| **Cost** | Free | Free + $299+/mo for production |
| **Observability** | No | Full tracing |
| **Custom judge prompts** | Version in Git | API-based |
| **CI/CD ready** | Yes | Requires API calls |

**Choose AgentV if:** You iterate locally on evals, need deterministic + subjective judges together.
**Choose Langfuse if:** You need production observability + team dashboards.

## AgentV vs. LangSmith

| Feature | AgentV | LangSmith |
|---------|--------|-----------|
| **Evaluation** | Code + LLM custom | LLM-based (SDK) |
| **Deployment** | Local (no server) | Cloud only |
| **Framework lock-in** | None | LangChain ecosystem |
| **Open source** | MIT | Closed |
| **Local execution** | Yes | No (requires API calls) |
| **Observability** | No | Full tracing |

**Choose AgentV if:** You want local evaluation, deterministic judges, open source.
**Choose LangSmith if:** You're LangChain-heavy, need production tracing.

## AgentV vs. LangWatch

| Feature | AgentV | LangWatch |
|---------|--------|-----------|
| **Evaluation focus** | Development-first | Team collaboration first |
| **Execution** | Local | Cloud/self-hosted server |
| **Custom judge prompts** | Markdown files (Git) | UI-based |
| **Code judges** | Yes | LLM-focused |
| **Setup** | &lt; 2 min | 20+ min |
| **Team features** | No | Annotation, roles, review |

**Choose AgentV if:** You develop locally, want fast iteration, prefer code judges.
**Choose LangWatch if:** You need team collaboration, managed optimization, on-prem deployment.

## AgentV vs. Google ADK

| Feature | AgentV | Google ADK |
|---------|--------|-----------|
| **Purpose** | Evaluation | Agent development |
| **Evaluation capability** | Comprehensive | Built-in metrics only |
| **Setup** | &lt; 2 min | 30+ min |
| **Code-first** | YAML-first | Python-first |

**Choose AgentV if:** You need to evaluate agents (not build them).
**Choose Google ADK if:** You're building multi-agent systems.

## AgentV vs. Mastra

| Feature | AgentV | Mastra |
|---------|--------|--------|
| **Purpose** | Agent evaluation & testing | Agent/workflow development framework |
| **Evaluation** | Core focus (code + LLM judges) | Secondary, built-in only |
| **Agent Building** | No (tests agents) | Yes (builds agents with tools, workflows) |
| **Open Source** | MIT | MIT |

**Choose AgentV if:** You need to test/evaluate agents.
**Choose Mastra if:** You're building TypeScript AI agents and need orchestration.

## When to Use AgentV

**Best for:** Individual developers and teams that evaluate locally before deploying, and need custom evaluation criteria.

**Use something else for:**
- Production observability → Langfuse or LangWatch
- Team dashboards → LangWatch, Langfuse, or Braintrust
- Building agents → Mastra (TypeScript) or Google ADK (Python)
- Standardized benchmarking → OpenCode Bench

## Ecosystem Recommendation
AgentV is the **evaluation layer** in the AI agent lifecycle. It works alongside runtime governance and observability tools — each handles a different concern with zero overlap.

## The Three Layers

| Layer | Tool | Question it answers |
|-------|------|-------------------|
| **Evaluate** (pre-production) | [AgentV](https://github.com/EntityProcess/agentv) | "Is this agent good enough to deploy?" |
| **Govern** (runtime) | [Agent Control](https://github.com/agentcontrol/agent-control) | "Should this action be allowed?" |
| **Observe** (runtime) | [Langfuse](https://github.com/langfuse/langfuse) | "What is the agent doing in production?" |

### AgentV — Evaluate

Offline evaluation and testing. Run eval cases against agents, score with deterministic code graders + LLM judges, detect regressions, gate CI/CD pipelines. Everything lives in Git.

```
agentv eval evals/my-agent.yaml
```

### Agent Control — Govern

Runtime guardrails. Intercepts agent actions (tool calls, API requests) and evaluates them against configurable policies. Deny, steer, warn, or log — without changing agent code. Pluggable evaluators with confidence scoring.

### Langfuse — Observe

Production observability. Traces agent execution with explicit Tool/LLM/Retrieval observation types, ingests evaluation scores, and provides dashboards for debugging and monitoring. Self-hostable.

## How They Connect

```
Build agents (Mastra / Google ADK)
Evaluate locally (AgentV)
Block regressions in CI/CD (AgentV)
Monitor in production (Langfuse / LangWatch / Braintrust)
Define evals (YAML in Git)
|
v
Run evals locally or in CI (AgentV)
|
v
Deploy agent to production
|
v
Enforce policies on tool calls (Agent Control)
| |
v v
Trace execution (Langfuse) Log violations (Agent Control)
|
v
Feed production traces back into evals (AgentV)
```

The feedback loop is key: Langfuse traces surface real-world failures that become new AgentV eval cases. Agent Control deny/steer events identify safety gaps that become new test scenarios.

## Traditional Software Analogy

This maps to how traditional software works:

| Traditional | AI Agent Equivalent |
|------------|-------------------|
| Test suite (Jest, pytest) | **AgentV** |
| WAF / auth middleware | **Agent Control** |
| APM / logging (Datadog) | **Langfuse** |

## When to Use What

**AgentV** handles:
- Eval definition and execution
- Code + LLM graders
- Regression detection and CI/CD gating
- Multi-provider A/B comparison

**Agent Control** handles:
- Runtime policy enforcement (deny/steer/warn/log)
- Pre/post execution evaluation of agent actions
- Pluggable evaluators (regex, JSON, SQL, LLM-based)
- Centralized control plane with dashboard

**Langfuse** handles:
- Production tracing with agent-native observation types
- Live evaluation automation on trace ingestion
- Score ingestion from external evaluators
- Team dashboards and debugging
Loading