Skip to content

Commit 4ba8992

Browse files
andrii-harbourtupizzcaio-pizzol
authored
feat: refactor MCP, fix Codex errors, reorganize AI agents documentation (#2711)
* feat: refactor MCP, fix Codex errors, reorganize AI agents documentation and add new content - Removed outdated GDPval benchmark command from evals section in AGENTS.md. - Updated the structure of the docs.json file to categorize AI agents under "MCP" and "Agents" groups, adding new pages for MCP and skills. - Introduced new documentation files for best practices, debugging, eval results, integrations, and skills, providing comprehensive guidance on using SuperDoc tools with LLMs. - Added detailed instructions on how to use the MCP server and its debugging features, enhancing the overall documentation for better user experience. * feat(evals): Level 3 DOCX agent benchmark suite (#2664) * feat(evals): add extractDocxText utility for benchmark text extraction * feat(evals): add benchmarkMetrics assertion for Level 3 benchmark * feat(evals): add Claude Code benchmark provider for Level 3 * feat(evals): add Codex benchmark provider for Level 3 * feat(evals): add 18 benchmark tasks for Level 3 agent comparison * feat(evals): add benchmark report generator for Level 3 * feat(evals): add Level 3 benchmark Promptfoo config with 10 conditions * fix(evals): fix providers and assertions for Level 3 benchmark - Fix cwd ENOENT: create stateDir before passing to SDK query() - Fix Claude Code provider: clean up, remove pathToClaudeCodeExecutable hacks - Fix Codex provider: match real SDK API (command_execution items, approvalPolicy) - Fix test assertions: match actual fixture content - contract.docx -> report-with-formatting.docx for heading tasks - [Employee Name] -> [Candidate Name] for employment-offer.docx - Fix $150M collateral check (XML extraction splits as "1 50") - Upgrade @anthropic-ai/claude-agent-sdk to ^0.2.87 * fix(evals): fix sandbox writes, add useClaudeSettings, MCP support - Copy fixture into stateDir so agents can write within their sandbox - Add stateDir fallback for output file detection - Add useClaudeSettings option to inherit local Claude Code config (MCP servers, skills, CLAUDE.md) via settingSources - Add CC-local condition for testing with user's own Claude Code setup - Wire superdocMcp config to attach SuperDoc MCP server via mcpServers - Add preeval:benchmark script to build MCP server before runs - Add model, maxTurns, systemPrompt config options * test(evals): add e2e smoke test for Level 3 benchmark providers Standalone test script that verifies both providers end-to-end: - Claude baseline read/edit (without SuperDoc) - Claude superdoc-skill with MCP (superdoc_open → get_content → close) - Claude local with useClaudeSettings - Codex baseline read/edit (without SuperDoc) - Codex with SuperDoc MCP Run: node evals/scripts/smoke-test-benchmark.mjs --claude --codex * feat(evals): enforce SuperDoc MCP usage via system prompt and AGENTS.md - Add system prompt for superdoc conditions instructing agents to use SuperDoc MCP tools exclusively, not raw unzip/XML - Write AGENTS.md in working directory reinforcing SuperDoc tool usage - Restrict CC-superdoc-skill allowedTools to Read/Glob/Grep (no Bash) so agents cannot fall back to raw DOCX manipulation - Add prompt reinforcement for Codex superdoc conditions - Verified: Claude superdoc-skill read + edit both use MCP exclusively (superdoc_open → search → edit → save → close, zero Bash calls) * fix(evals): pass OPENAI_API_KEY to Codex SDK, update smoke tests - Pass process.env.OPENAI_API_KEY to new Codex({ apiKey }) so the SDK uses API key auth instead of relying on codex login session - Add Claude edit + MCP tests to smoke test script - Verified: Codex baseline read + edit pass with API key auth - Known: Codex MCP calls fail due to rmcp protocol incompatibility in the Codex CLI (serde error on tool calls, Transport closed) * fix(mcp,evals): fix stdout corruption killing Codex MCP transport Root cause: console.debug('[super-editor] Telemetry: enabled') in Editor.ts writes to stdout when superdoc_open initializes the editor. The Codex CLI's Rust MCP client (rmcp) parses stdout as JSON-RPC and dies with "serde error expected value at line 1 column 2" on the non-JSON line, closing the transport. Fixes: - Redirect all console methods (log/info/debug/warn) to stderr in the MCP server entry point, before any imports run - Add mcp_auto_approve config for Codex to auto-approve MCP tool calls (approval_policy=never only covers shell commands, not MCP) - Add stdio wrapper script for transport debugging (logs raw bytes) - Use runStreamed() in Codex provider to capture full MCP event lifecycle - Pass minimal env to prevent other stdout pollution from deps - Add preflight check for MCP server build artifact * refactor(evals): trim benchmark to 6 compact tasks for v1 Reduce from 18 to 6 tasks (3 reading + 3 editing) for faster iteration. Full suite: 12 runs in 3 minutes, 100% pass rate on Codex baseline + superdoc-skill conditions. Tasks: extract headings, extract entities, extract financials, replace entity, insert section, fill placeholders. * fix(evals): fix report generator to extract metrics from parsed output * feat(evals): improve benchmark report with full AC metrics - Add per-task detail table with every metric per condition - Add input/output token breakdown (not just total) - Add p95 latency alongside median - Add estimated cost per task (based on model token pricing) - Add comprehensive recommendation with latency, token, cost, steps, and collateral comparisons between conditions - Fix task description extraction from vars.task fallback * feat(evals): split benchmark metrics into individual Promptfoo columns Replace single benchmarkMetrics assertion with separate per-metric assertions (steps, latency, tokens, path), each with its own metric tag. Promptfoo displays these as individual columns with actual numeric values instead of a single "efficiency 1.00" score. Columns visible in UI: correctness, collateral, steps, latency, tokens, path * fix(evals): create superdoc CLI wrapper on PATH for superdoc-cli condition The superdocOnPath flag was a no-op because the SuperDoc CLI was never installed as a binary on PATH. Now creates a shell wrapper script in the stateDir's bin/ that delegates to apps/cli/dist/index.js, and prepends it to the agent's PATH. Finding: even with superdoc on PATH, Codex doesn't discover or use it without explicit instruction. All superdoc-cli runs fall back to raw unzip/XML. This is valid benchmark data. * feat(evals): enforce SuperDoc usage and fail when agents don't use it - benchmarkPath assertion now FAILS when superdoc-skill or superdoc-cli conditions don't use SuperDoc (was always passing before) - Add AGENTS.md + prompt hint for superdoc-cli condition telling agents the CLI exists on PATH with common commands - Split MCP and CLI AGENTS.md templates in both providers - Verified: all 3 Codex conditions use correct path (baseline=raw, superdoc-skill=MCP, superdoc-cli=CLI) * feat(evals): add _summary field for readable Promptfoo cell previews Add a _summary line at the top of provider JSON output showing path | steps | latency | tokens at a glance. Promptfoo renders the start of the output in each table cell, so this gives immediate visibility without clicking into the detail view. * feat(evals): add derivedMetrics and weight:0 for info-only metrics - Add derivedMetrics: avg_latency, avg_steps, avg_tokens, superdoc_usage_pct - computed per provider after evaluation - Set weight: 0 on steps/latency/tokens assertions so they report values without affecting pass/fail score - Only correctness, collateral, and path drive pass/fail - Click "Show Charts" in Promptfoo UI for visual comparison * feat(evals): add unit labels to metric names for self-documenting UI * revert(evals): restore original metric names * feat(evals): add Anthropic vendor DOCX skill to benchmark matrix Add the Anthropic DOCX skill (from anthropics/skills repo) as the vendor condition. When vendorSkill: true, the skill is installed as AGENTS.md in the working directory, teaching agents to use unzip/XML for reading and docx-js for creation. This completes the benchmark matrix: - baseline: no skill, agent figures it out - vendor: Anthropic's DOCX skill (unzip + docx-js) - superdoc-skill: SuperDoc MCP server - superdoc-cli: SuperDoc CLI on PATH - choice: all available, agent picks * refactor(evals): clean up benchmark config to 4 conditions × 2 agents * fix(evals): use CLAUDE.md instead of AGENTS.md for Claude Code provider Claude Agent SDK reads CLAUDE.md (not AGENTS.md) for project context. Write vendor skill and CLI instructions as CLAUDE.md in the stateDir, and enable settingSources: ['project'] so the SDK loads it. * feat(docs): document Level 3 DOCX agent benchmark in CLAUDE.md * docs(evals): add guide for reading Level 3 benchmark results * docs(evals): add PRD for benchmark v2 document fidelity scoring * Revert "docs(evals): add PRD for benchmark v2 document fidelity scoring" This reverts commit 85108ac. * feat(evals): add DOCX fidelity checker utility * feat(evals): add v2 fixture documents with rich formatting Creates 4 DOCX fixtures designed to be fragile under raw XML edits: - consulting-agreement.docx: bold defined terms, italic refs, 6 heading sections, $250k indemnification cap, net 45 payment terms - pricing-proposal.docx: 4-row pricing table with shaded header, right-aligned prices, US Letter page size - contract-redlines.docx: 3 tracked insertions + 2 deletions by Jane Editor, 2 reviewer comments by Bob Reviewer - policy-manual.docx: 3-level nested numbered list (1./1.1/a)), header/footer with page numbers, page breaks between sections Adds create-v2-fixtures.mjs generator script and docx@9.6.1 dev dependency. * feat(evals): add benchmarkFidelity and benchmarkDiff assertions * feat(evals): add 6 fidelity-sensitive v2 benchmark tasks * feat(evals): add benchmark v2 with document fidelity scoring New capabilities: - docx-fidelity.mjs: OOXML structural checker (formatting, styles, numbering, tracked changes, comments, tables, XML diff) - benchmarkFidelity assertion: runs fidelity checks on output DOCX - benchmarkDiff assertion: measures XML change ratio (surgical vs rewrite) New fixtures (all synthetic names): - consulting-agreement.docx: bold terms, italic refs, numbered sections - pricing-proposal.docx: table with alignment and styled header - contract-redlines.docx: existing tracked changes and comments - policy-manual.docx: 3-level nested numbered lists 6 new fidelity tasks (CEO examples): - Mixed formatting replace (bold preservation) - Table cell edit (structure preservation) - Tracked changes edit (annotation survival) - Nested list insert (numbering continuation) - Multi-step workflow (heading style check) - Edit with existing annotations (comment survival) 92 tests total: 69 checks.cjs + 23 docx-fidelity * fix(evals): fix 3 fidelity assertion bugs found in first v2 run 1. outputFile pointed to unedited fixture copy instead of localDocPath (the file the agent actually edits in stateDir) 2. Comment IDs in fidelity checks used "0","1" but fixture has "1","2" 3. Table cell text used exact match instead of includes 4. Remove overly strict paragraphStyle check on multi-step task * feat(evals): redesign v2 tasks around proven SuperDoc advantages Category A — Structural creation (SuperDoc proven): - Create heading with Heading1 style - Create table with borders and data rows Category B — Formatting (SuperDoc proven): - Make specific text bold - Replace text preserving formatting Category C — Complex edits (track improvement): - Tracked change replacement - Add comment to clause * fix(evals): stop loading user MCP servers, reduce token cost 30% Remove settingSources which loaded ALL user MCP servers (43 Linear, 5 Excalidraw, Gmail, etc.) adding ~4000 tokens per turn. Pass CLAUDE.md content as systemPrompt instead. Result: 30% cost reduction ($0.97 -> $0.68 for NDA creation). * docs(evals): add benchmark findings and next steps document * fix(evals): set settingSources: [] for SDK isolation mode * docs(evals): add MCP efficiency analysis with prioritized fixes * refactor(evals): update provider labels in benchmark configuration for clarity Changed labels for several providers in the promptfooconfig.benchmark.yaml file to better reflect their functionality, including renaming 'CC-vendor' to 'CC-with-docx-skill', 'CC-superdoc-skill' to 'CC-superdoc-mcp', and others for consistency and improved understanding. * feat(evals): update agent conditions and documentation for SuperDoc MCP usage * feat(sdk): optimize tool definitions and prompts for efficient MCP workflows (#2722) * feat(sdk): update tool definitions for efficient multi-block workflows - superdoc_edit: emphasize markdown insert for multi-section creation - superdoc_create: direct to markdown/mutations for multiple items - superdoc_mutations: document create steps and batch format pattern - superdoc_format: direct to mutations for multi-item formatting - superdoc_search: clarify ref lifecycle within vs across batches - system-prompt: add efficient document creation workflow * feat(evals,sdk): add efficient workflow patterns to all agent touchpoints - Update provider SUPERDOC_SYSTEM_PROMPT with markdown insert and mutations batch examples (what CC actually reads as system prompt) - Update Codex AGENTS.md with same efficient patterns - Update MCP header prompt with "when to use which tool" guide - Increase CC maxTurns from 20 to 35 (both CC failures were at 21) - Regenerate SDK artifacts and rebuild MCP server * feat(evals): enable tool search to reduce token overhead * docs(ai): add markdown insert pattern and formatting guidance * docs(ai): add efficient patterns to MCP how-to-use guide * fix(evals): remove debug console.log that dumped every SDK message * feat(document-api): add alignment field to StyleApplyStep and StyleApplyInput types * fix(document-api): keep inline required on StyleApplyInput, guard optional inline in step executors * feat(document-api): add alignment to format.apply step JSON schema * feat(super-editor): support alignment in format.apply mutation step * docs(sdk): update tool descriptions to show alignment inside format.apply step * feat(document-api): add scope: block to format.apply for full-paragraph formatting * feat(document-api): allow placement and BlockNodeAddress target for markdown inserts * chore: regenerate SDK artifacts and docs from updated contract * feat(evals): add new NDA documents and implement interactive DOCX output reviewer * fix: address PR review — minProperties, RichContentInsertInput type, deduplicate alignment constant * Revert "fix: address PR review — minProperties, RichContentInsertInput type, deduplicate alignment constant" This reverts commit 4c04ebd. * fix(document-api): add minProperties, type export, shared alignment constant * docs(sdk): require fontSize on headings after markdown insert * docs(sdk): context-driven formatting guidance for markdown inserts * docs(sdk): only set properties explicitly present in document blocks * feat(super-editor): resolve default fontSize in get_content blocks response * fix(super-editor): fallback to 10pt default when styles omit fontSize * fix(super-editor): resolve fontSize per-block via style chain in get_content * test(super-editor): add fontSize style chain resolution tests for blocks.list * docs(sdk): guide agents to match uppercase title conventions * feat(document-api): update JSON schema and documentation for mutations and system prompts * refactor: enhance evaluation suite with new configurations and documents - Updated .gitignore to include new artifacts and temporary files. - Refactored package.json scripts for improved evaluation commands and added a clean script. - Introduced new configuration files for benchmark and execution tests, enhancing the evaluation framework. - Added detailed documentation on efficiency analysis and findings from the Level 3 benchmark. * refactor(evals): update entity names in documentation and tasks * feat(docs): add AI documentation and enhance getting started guide * fix(evals): update execution promptfoo configuration and remove obsolete documents - Added a blank line in the execution promptfoo configuration for clarity. - Deleted outdated efficiency analysis, findings, and how-to-read-results documents to streamline the documentation. * chore(evals): update README and remove obsolete DOCX files * feat: improve agent redline targeting and validation (#2764) * fix: refresh lockfile for evals deps * fix: update documentation and address comments - Added new routing in docs.json for the getting started AI overview. - Updated links in best-practices.mdx, debugging.mdx, and integrations.mdx to reflect new paths. - Adjusted eval-results.mdx to correct the number of models tested and updated references to LLM tools. - Removed outdated getting-started/ai.mdx and system-prompt-mcp.md files. - Enhanced error handling in mcp-stdio-wrapper.mjs and updated paths in various scripts and configurations. - Refactored benchmark scripts and configurations to improve clarity and functionality. * refactor: consolidate shared logic for benchmark providers - Introduced a new `agent-harness.mjs` file to centralize common functionality for Claude Code and Codex benchmark providers. - Refactored existing code in `claude-code-agent.mjs` and `codex-agent.mjs` to utilize shared methods for setup, preflight checks, and skill/CLI installation. - Updated paths and removed redundant code to enhance clarity and maintainability. - Adjusted test fixtures path in `docx-fidelity.test.mjs` for consistency. * docs: update LLM tools documentation with action details * feat(session-manager): add telemetry metadata for document editing source * docs: add new AI getting started page with redirect to overview * chore: update pnpm-lock.yaml with new dependencies and version updates - Updated `@inquirer/checkbox` and `@inquirer/confirm` dependencies to use the latest types. - Cleaned up optional dependencies and ensured compatibility with existing packages. * docs(sd-2451): align AI doc voice with brand guidelines (#2802) Voice pass against brand.md for the new AI/MCP docs. No content changes — just phrasing that matched brand rules more directly. - skills.mdx: drop "COMING SOON" tag and rephrase "coming soon" to "we haven't shipped skills yet" per brand voice rule preferring "we haven't built that yet" over roadmap language - llm-tools.mdx Note: rewrite "more tools are being added" to lead with what works today and point to custom tools - llm-tools.mdx: simplify "enforces mutual exclusivity constraints" to "checks that arguments are compatible" - eval-results.mdx: simplify "any scenario where latency is not the primary constraint" to "any case where speed doesn't matter" - best-practices.mdx: split semicolon heading to use a dash --------- Co-authored-by: Tadeu Tupinambá <tadeu.tupiz@gmail.com> Co-authored-by: Caio Pizzol <97641911+caio-pizzol@users.noreply.github.com>
1 parent 47777f5 commit 4ba8992

110 files changed

Lines changed: 100005 additions & 1578 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 51 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,22 +118,70 @@ Many packages use `.js` files with JSDoc `@typedef` for type definitions (e.g.,
118118

119119
## AI Eval Suite
120120

121-
The `evals/` directory contains a Promptfoo-based evaluation suite for validating AI tool call quality.
121+
The `evals/` directory contains a Promptfoo-based evaluation suite with three levels of evaluation.
122+
123+
### Level 1: Deterministic Evals (tool selection + argument accuracy)
122124

123125
| Command | What it does | Cost |
124126
|---------|-------------|------|
125127
| `pnpm --filter @superdoc-testing/evals run eval` | Run deterministic evals (reading + argument tests) | ~$0.30 |
126128
| `pnpm --filter @superdoc-testing/evals run eval:reading` | Run reading tool tests only | ~$0.15 |
127-
| `pnpm --filter @superdoc-testing/evals run eval:gdpval` | Run GDPval benchmark (Model+SuperDoc vs Model-Only) | ~$1-2 |
128129
| `pnpm --filter @superdoc-testing/evals run eval:view` | Open Promptfoo web UI with results | Free |
129130
| `pnpm --filter @superdoc-testing/evals run baseline:save <label>` | Save versioned results snapshot | Free |
130131

131132
Tool definitions are extracted from `packages/sdk/tools/` via `evals/tools/extract.mjs`. Run `pnpm run generate:all` first if SDK artifacts are missing.
132133

133-
Test files are YAML in `evals/tests/`. Each test has a `vars.task` prompt and JavaScript assertions that check tool call structure (Level 1: tool selection + argument accuracy, not execution).
134+
Test files are YAML in `evals/tests/`. Each test has a `vars.task` prompt and JavaScript assertions that check tool call structure (tool selection + argument accuracy, not execution).
134135

135136
The system prompt at `evals/prompts/agent.txt` is a copy of the proven prompt from `examples/eval-demo/lib/agent.ts`. Update both when changing the prompt.
136137

138+
### Level 2: GDPval Benchmark (Model+SuperDoc vs Model-Only)
139+
140+
| Command | What it does | Cost |
141+
|---------|-------------|------|
142+
| `pnpm --filter @superdoc-testing/evals run eval:gdpval` | Run GDPval benchmark | ~$1-2 |
143+
144+
### Level 3: DOCX Agent Benchmark (real agents, real documents)
145+
146+
Runs actual Claude Code and Codex CLIs against DOCX tasks, comparing their performance with and without SuperDoc tools. 4 conditions x 2 agents x N tasks.
147+
148+
**Conditions:**
149+
150+
| Condition | What the agent gets |
151+
|-----------|-------------------|
152+
| baseline | No skill, agent figures out DOCX on its own |
153+
| baseline-with-docx-skill | Anthropic's DOCX skill (unzip + XML editing) |
154+
| superdoc-mcp | SuperDoc MCP server (`superdoc_open`, `superdoc_get_content`, etc.) |
155+
| superdoc-cli | SuperDoc CLI on PATH |
156+
157+
**Tasks:** 3 reading (extract headings, entity names, financial figures) + 3 editing (replace entity name, insert section, fill placeholders).
158+
159+
**Metrics per task:** correctness (pass/fail), collateral (no unintended changes), steps (agent turn count), latency (seconds), tokens (input + output), path (which DOCX approach was used).
160+
161+
| Command | What it does | Cost |
162+
|---------|-------------|------|
163+
| `pnpm --filter @superdoc-testing/evals run eval:benchmark` | Run full benchmark | ~15 min |
164+
| `pnpm --filter @superdoc-testing/evals run eval:benchmark:codex` | Run Codex conditions only | ~8 min |
165+
| `pnpm --filter @superdoc-testing/evals run eval:benchmark:claude` | Run Claude Code conditions only | ~8 min |
166+
| `pnpm --filter @superdoc-testing/evals run eval:benchmark:report` | Generate comparison report (Markdown + CSV) | Free |
167+
168+
**Prerequisites:**
169+
- `OPENAI_API_KEY` in `evals/.env` (for Codex; use `codex login --with-api-key` for API key auth)
170+
- Claude Code installed locally (uses local auth, no API key needed in `.env`)
171+
- MCP server built: `cd apps/mcp && pnpm run build`
172+
- CLI built: check `apps/cli/dist/index.js` exists
173+
174+
**Key files:**
175+
176+
| File | Purpose |
177+
|------|---------|
178+
| `evals/config/benchmark.promptfoo.yaml` | Level 3 Promptfoo config (8 providers) |
179+
| `evals/suites/benchmark/tests/agent-benchmark-v2.yaml` | Benchmark tasks with assertions |
180+
| `evals/providers/claude-code-agent.mjs` | Claude Agent SDK provider |
181+
| `evals/providers/codex-agent.mjs` | Codex SDK provider |
182+
| `evals/suites/benchmark/reports/benchmark-report.mjs` | Markdown + CSV report generator |
183+
| `evals/fixtures/vendor/vendor-docx-skill.md` | Anthropic's DOCX skill for baseline-with-docx-skill condition |
184+
137185
## Generated Artifacts
138186

139187
These directories are produced by `pnpm run generate:all`:

apps/cli/src/__tests__/lib/validate-type-spec.test.ts

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,42 @@ describe('validateValueAgainstTypeSpec – oneOf with mixed schemas', () => {
7272
});
7373
});
7474

75+
describe('validateValueAgainstTypeSpec – repeated actionable oneOf errors', () => {
76+
const repeatedUnknownKeySchema: CliTypeSpec = {
77+
oneOf: [
78+
{
79+
type: 'object',
80+
properties: {
81+
id: { type: 'string' },
82+
op: { const: 'text.rewrite' },
83+
},
84+
required: ['id', 'op'],
85+
},
86+
{
87+
type: 'object',
88+
properties: {
89+
id: { type: 'string' },
90+
op: { const: 'text.insert' },
91+
},
92+
required: ['id', 'op'],
93+
},
94+
],
95+
};
96+
97+
test('surfaces the shared nested schema error instead of the generic oneOf message', () => {
98+
try {
99+
validateValueAgainstTypeSpec({ id: 'r1', op: 'text.rewrite', '},{': ':' }, repeatedUnknownKeySchema, 'steps[0]');
100+
throw new Error('Expected CliError to be thrown');
101+
} catch (error) {
102+
const cliError = error as CliError;
103+
expect(cliError.message).toBe('steps[0].},{ is not allowed by schema.');
104+
expect((cliError.details as { selectedError?: string }).selectedError).toBe(
105+
'steps[0].},{ is not allowed by schema.',
106+
);
107+
}
108+
});
109+
});
110+
75111
describe('validateValueAgainstTypeSpec – enum branch', () => {
76112
const enumSchema: CliTypeSpec = {
77113
type: 'string',

apps/cli/src/lib/operation-args.ts

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,37 @@ function extractConstValues(variants: CliTypeSpec[]): string[] {
115115
return values;
116116
}
117117

118+
function isNestedValidationMessage(path: string, message: string): boolean {
119+
return message.startsWith(`${path}.`) || message.startsWith(`${path}[`);
120+
}
121+
122+
function selectRepeatedActionableOneOfError(path: string, errors: string[]): string | null {
123+
const counts = new Map<string, number>();
124+
for (const error of errors) {
125+
counts.set(error, (counts.get(error) ?? 0) + 1);
126+
}
127+
128+
let bestMessage: string | null = null;
129+
let bestScore = 0;
130+
131+
for (const [message, count] of counts.entries()) {
132+
if (count < 2) continue;
133+
134+
const nested = isNestedValidationMessage(path, message);
135+
const isShapeError = message.includes(' is not allowed by schema.') || message.includes(' is required.');
136+
137+
if (!nested && !isShapeError) continue;
138+
139+
const score = count * 10 + (nested ? 2 : 0) + (isShapeError ? 1 : 0);
140+
if (score > bestScore) {
141+
bestScore = score;
142+
bestMessage = message;
143+
}
144+
}
145+
146+
return bestMessage;
147+
}
148+
118149
export function validateValueAgainstTypeSpec(value: unknown, schema: CliTypeSpec, path: string): void {
119150
if ('const' in schema) {
120151
if (value !== schema.const) {
@@ -136,11 +167,12 @@ export function validateValueAgainstTypeSpec(value: unknown, schema: CliTypeSpec
136167
}
137168

138169
const allowedValues = extractConstValues(variants);
170+
const selectedError = selectRepeatedActionableOneOfError(path, errors);
139171
const message =
140172
allowedValues.length > 0
141173
? `${path} must be one of: ${allowedValues.join(', ')}.`
142-
: `${path} must match one of the allowed schema variants.`;
143-
throw new CliError('VALIDATION_ERROR', message, { errors });
174+
: (selectedError ?? `${path} must match one of the allowed schema variants.`);
175+
throw new CliError('VALIDATION_ERROR', message, { errors, selectedError });
144176
}
145177

146178
if (schema.type === 'json') return;
@@ -236,7 +268,11 @@ function validateResponseValueAgainstTypeSpec(value: unknown, schema: CliTypeSpe
236268
errors.push(error instanceof Error ? error.message : String(error));
237269
}
238270
}
239-
throw new CliError('VALIDATION_ERROR', `${path} must match one of the allowed schema variants.`, { errors });
271+
const selectedError = selectRepeatedActionableOneOfError(path, errors);
272+
throw new CliError('VALIDATION_ERROR', selectedError ?? `${path} must match one of the allowed schema variants.`, {
273+
errors,
274+
selectedError,
275+
});
240276
}
241277

242278
if (schema.type === 'json') return;
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
---
2+
title: Best practices
3+
sidebarTitle: Best practices
4+
description: Get better results from LLM document editing — prompting, tool call patterns, and workflow tips
5+
keywords: "llm best practices, ai document editing, prompt engineering, superdoc tools, tool calling, document automation"
6+
---
7+
8+
These patterns help your LLM agent produce reliable, efficient document edits.
9+
10+
## Use the bundled system prompt
11+
12+
`getSystemPrompt()` returns a tested prompt that teaches the model how to use SuperDoc tools — targeting, workflow order, and multi-action tools. Load it once and pass it as the system message.
13+
14+
```typescript
15+
import { getSystemPrompt } from '@superdoc-dev/sdk';
16+
17+
const systemPrompt = await getSystemPrompt();
18+
// Pass as the system message in your LLM call
19+
```
20+
21+
You can extend it with task-specific instructions. Append your own rules after the bundled prompt:
22+
23+
```typescript
24+
const systemPrompt = await getSystemPrompt();
25+
const fullPrompt = `${systemPrompt}\n\n## Additional rules\n- Use tracked changes for all edits.\n- Always search before editing.`;
26+
```
27+
28+
Or start from scratch with something like this:
29+
30+
````markdown
31+
You edit `.docx` files using SuperDoc intent tools. Be efficient and minimize tool calls.
32+
33+
## Workflow
34+
35+
1. **Read** — Use `superdoc_get_content` to understand the document.
36+
2. **Search** — Use `superdoc_search` to find stable handles or block addresses.
37+
3. **Edit** — Use the focused tool that matches the job:
38+
- `superdoc_edit` for insert, replace, delete, undo, redo
39+
- `superdoc_format` for inline or paragraph formatting
40+
- `superdoc_create` for paragraphs and headings
41+
- `superdoc_comment` for comment threads
42+
- `superdoc_track_changes` for review decisions
43+
4. **Batch only when useful** — Use `superdoc_mutations` for preview/apply or atomic multi-step edits.
44+
45+
## Rules
46+
47+
- Search before mutating so targets come from fresh results.
48+
- Use focused intent tools for normal edits.
49+
- Use `superdoc_mutations` when you need an atomic batch or preview/apply flow.
50+
- Set `changeMode: "tracked"` when edits need human review.
51+
- Feed tool errors back so you can recover.
52+
````
53+
54+
## Read first, search, then edit
55+
56+
A typical edit takes 3-5 tool calls:
57+
58+
1. `superdoc_get_content` — understand what's in the document
59+
2. `superdoc_search` — find the exact location (returns stable handles/addresses)
60+
3. Edit tool (`superdoc_edit`, `superdoc_format`, etc.) — apply the change using targets from search
61+
62+
This matters because handles from search results point to the exact right location. If the model guesses a block address instead of searching for it, edits land in the wrong place.
63+
64+
## Minimize tool calls
65+
66+
Instruct the LLM to plan all edits before calling tools. A well-structured prompt like "Find the termination clause and rewrite it to allow 30-day notice" should take 3-5 calls, not 15.
67+
68+
Batch multiple changes only when atomic execution is genuinely helpful — use `superdoc_mutations` for that.
69+
70+
## Prefer markdown insert for multi-block creation
71+
72+
When you need to create multiple headings and paragraphs in one operation, use `superdoc_edit` with `type: "markdown"` instead of calling `superdoc_create` once per block. A single markdown insert produces the entire structure in one call.
73+
74+
```json
75+
{
76+
"action": "insert",
77+
"type": "markdown",
78+
"value": "## Executive Summary\n\nThis agreement governs the terms of service.\n\n## Key Provisions\n\nThe following provisions apply to all parties."
79+
}
80+
```
81+
82+
After inserting, apply formatting in a single `superdoc_mutations` batch using `format.apply` steps — one step per block or range. This reduces a workflow that might otherwise take 40+ calls down to 4: read, search, insert, format.
83+
84+
## Use focused tools — `superdoc_mutations` is an escape hatch
85+
86+
For straightforward edits, use the focused intent tools (`superdoc_edit`, `superdoc_format`, `superdoc_create`, `superdoc_list`, `superdoc_comment`). They validate arguments, give clear errors, and are easier for models to call correctly.
87+
88+
Reach for `superdoc_mutations` only when you need:
89+
- Preview/apply semantics (show what will change before committing)
90+
- Atomic multi-step edits (all-or-nothing batch)
91+
- A workflow that would otherwise require refreshing targets between steps
92+
93+
## Feed errors back
94+
95+
`dispatchSuperDocTool` returns structured errors. Pass them back as tool results — most models self-correct on the next turn.
96+
97+
```typescript
98+
try {
99+
const result = await dispatchSuperDocTool(doc, toolCall.function.name, JSON.parse(toolCall.function.arguments));
100+
messages.push({ role: 'tool', tool_call_id: toolCall.id, content: JSON.stringify(result) });
101+
} catch (err: any) {
102+
// Return the error as a tool result — the model will see it and adjust
103+
messages.push({ role: 'tool', tool_call_id: toolCall.id, content: JSON.stringify({ error: err.message }) });
104+
}
105+
```
106+
107+
## Choose formatting values from the document
108+
109+
Don't hardcode formatting values. Read them from the document's existing content and match what's already there.
110+
111+
**Body text:** Read `fontFamily`, `fontSize`, and `color` from non-empty paragraphs with `alignment: "justify"` or `alignment: "left"`. Set `bold: false` for body paragraphs.
112+
113+
Many DOCX documents report `underline: true` on all blocks due to style inheritance. This is a DOCX artifact — not intentional formatting. Do not carry it forward when inserting new paragraphs.
114+
115+
**Headings:** Read from existing heading blocks in the document. Scale `fontSize` up relative to body text. Headings are typically bold and sometimes centered — confirm against what's already in the document rather than assuming.
116+
117+
```typescript
118+
// Get content first, find a representative body paragraph
119+
const content = await superdoc.getContent();
120+
const bodyParagraph = content.blocks.find(
121+
(b) => b.type === 'paragraph' && b.text?.trim().length > 0
122+
);
123+
const { fontFamily, fontSize, color } = bodyParagraph?.formatting ?? {};
124+
125+
// Use those values when formatting inserted content
126+
```
127+
128+
## Add examples for repeatable workflows
129+
130+
If the same kind of edit runs across many documents (e.g., always rewriting a specific clause, always adding a comment to a section), include a concrete tool call example in your system prompt. Models that see a working example of the exact tool invocation produce correct calls more reliably than models that only see the schema.
131+
132+
## Use tracked changes for review workflows
133+
134+
Add `changeMode: "tracked"` to edit tool calls, or instruct the model via the system prompt:
135+
136+
```
137+
Use tracked changes for all edits so a human can review them.
138+
```
139+
140+
This way every AI edit appears as a tracked change that users can accept or reject in SuperDoc or Microsoft Word.
141+
142+
## Pin your model version
143+
144+
Use a specific model ID (e.g., `gpt-4.1` or `claude-sonnet-4-6`) rather than an alias like `gpt-4o`. Aliases can change behavior between releases and break working tool call patterns.
145+
146+
## Cache tools and prompts
147+
148+
Tools and the system prompt don't change between requests. Load them once at startup and reuse across all conversations.
149+
150+
```typescript
151+
let cachedTools: any[] | null = null;
152+
let cachedSystemPrompt: string | null = null;
153+
154+
async function ensureToolsLoaded() {
155+
if (!cachedTools) {
156+
const result = await chooseTools({ provider: 'openai' });
157+
cachedTools = result.tools;
158+
}
159+
if (!cachedSystemPrompt) {
160+
cachedSystemPrompt = await getSystemPrompt();
161+
}
162+
return { tools: cachedTools, systemPrompt: cachedSystemPrompt };
163+
}
164+
```
165+
166+
## Prompt examples
167+
168+
These prompts have been tested against the SuperDoc tool set. Use them as inspiration for your own workflows, or include them as few-shot examples in your system prompt.
169+
170+
### Document review
171+
172+
- "Find the termination clause and rewrite it to require 30-day written notice. Use tracked changes."
173+
- "Apply yellow highlight to every sentence that contains an indemnification obligation."
174+
- "Replace all references to 'Contractor' with 'Service Provider' and make each replacement italic with tracked changes enabled."
175+
- "Underline every sentence that references payment terms or late fees."
176+
- "Insert CONFIDENTIAL — DO NOT DISTRIBUTE at the very top of the document and make it bold, red, 14pt."
177+
- "Scan the document for inconsistent capitalization of defined terms and fix them with tracked changes enabled."
178+
179+
### Formatting and structure
180+
181+
- "Format the entire document in Times New Roman, 12-point."
182+
- "Make all Heading 2 paragraphs bold and set them to 14-point font."
183+
- "Keep each section heading with the paragraph that follows it so they don't split across pages."
184+
- "Remove all extra blank paragraphs and convert all double spaces after periods to single spaces."
185+
- "Right-align all section headings."
186+
187+
### Content generation and editing
188+
189+
- "Add a new heading 'Learning Objectives' at the top, followed by a bullet list with 3 key takeaways from the document content."
190+
- "Read the document and add a heading 'Executive Summary' at the end, followed by a one-paragraph summary and a bullet list of the 5 key provisions."
191+
- "Find the governing law section and insert a new paragraph after it: 'Any disputes arising under this Agreement shall be resolved through binding arbitration.'"
192+
- "Find all paragraphs that mention 'personally identifiable information' and add a comment: 'Verify PII handling complies with current data retention policy.'"
193+
- "Convert the list of references at the end into a numbered list and restart numbering at 1."
194+
195+
### Search and replace
196+
197+
- "Rewrite all dates in this document in the format January 1, 2026."
198+
- "Replace every occurrence of 'FY2024' with 'FY2025' throughout the document."
199+
- "Add the § symbol before every section number reference."
200+
201+
## Related
202+
203+
- [LLM tools](/ai/agents/llm-tools) — tool catalog and SDK functions
204+
- [How to use](/ai/agents/integrations) — step-by-step integration guide
205+
- [Debugging](/ai/agents/debugging) — troubleshoot tool call failures
206+
- [Document API](/document-api/overview) — the operation set behind the tools

0 commit comments

Comments
 (0)