Make PR-AF reviews merge-gate aware by santoshkumarradha · Pull Request #44 · Agent-Field/pr-af

santoshkumarradha · 2026-06-09T18:41:53Z

Summary

Reviews now decide REQUEST_CHANGES from an explicit merge-gate verdict instead of severity labels alone, so advisory findings can still be posted without blocking a merge.
Inline comments and the review summary now show must-fix vs advisory status before GitHub posting, including safe-to-merge messaging when findings are non-blocking.
Comment polish runs through the existing .ai() configuration before posting, and stale benchmark output artifacts are removed from the repository.

Notes

Keeps the existing HITL flow intact; the merge gate runs after synthesis and before HITL/output.
Avoids adding public model/provider guidance to docs so deployment choices can evolve outside the README.

Validation

uv run pytest -q
uv run ruff check .

Phase A - Quick Wins: - Raise confidence thresholds (critical/important: 0.5, suggestion: 0.7, nitpick: 1.0) - Fix challenged-findings leak: return confirmed_findings not all_findings - Refocus systemic lens via adaptive meta-prompt (functional risks only) - Disable coverage loop (max_coverage_iterations: 0) - Drop nitpicks (min_severity: suggestion), cap comments at 15 Phase B - Core Precision: - Add FindingRelevanceGate (.ai() classifier) after evidence verification - Expand evidence verification to ALL findings (not just critical/important) - Remove adversary [:20] cap, dynamic batching covers all findings - Adversary batch_size: 10, max_batches: 0 (dynamic) Phase C - Dedup & Calibration: - Add batch_semantic_dedup (.harness()) for >8 findings - Add OutputCalibrationGate (.ai()) final precision filter Infra: - Consolidate model config to HARNESS_MODEL env var as single source - Add entrypoint.sh for runtime opencode config generation - Default webhook dry_run to true for safety - Make AGENTFIELD_SERVER and AGENT_CALLBACK_URL overridable via .env

Streaming Meta→Review Pipeline (Change 1): - Add _run_meta_selectors_streaming: emits dimensions to queue as each lens completes instead of blocking until all 3 finish - Add _run_streaming_review: consumes dims from queue and launches reviewers immediately as they arrive - Incremental cross-meta dedup with lock protection - Phases 3+4 now overlap — reviews start when first lens finishes - ~30% speedup on large repos (12 min → ~8 min) Archei-Compliant Data Flows (Change 3): - Add _format_findings_for_llm(): formats findings as natural language with [F1], [F2] reference keys instead of JSON dumps - Add _build_reference_key_map(): maps reference keys back to titles - Evidence verifier now receives natural language (archei: context for LLM → string) instead of model_dump() JSON - Adversary phase uses narrative format with reference keys - Compound finder uses narrative format - Add reference_key field to _VerifiedFinding and AdversaryResult schemas - Orchestrator handles both reference_key and title matching (backward compat) - OutputCalibrationGate: .ai() for ≤5 findings, .harness() fallback for >5 DAG Visibility: - Add @router.reasoner() to finding_relevance_gate, output_calibration_gate, batch_semantic_dedup so they appear in AgentField UI graph - Each reasoner wraps one logical intelligence unit (.ai() or .harness())

… "pr-af" Allows running multiple PR-AF nodes (e.g. pr-af, pr-af-v2) on the same control plane. The webhook handler now routes to the correct node based on the PR_AF environment variable.

Replace single serial evidence_verifier .harness() call with parallel per-finding verify_single_finding reasoners. Each finding gets its own .harness() that browses the repo independently to verify claims. - Add verify_single_finding @router.reasoner(): verifies one finding against actual source code, returns flat _VerifiedFinding schema - Orchestrator launches asyncio.gather(*[verify_one(f) for f in findings]) - Each verification shows as a child node in AgentField UI DAG - ~3x speedup on verification phase (10 min serial → ~3 min parallel) - No quality change: same verification logic, same schema, same result

Prompts: - Shorten review_dimension prompt from 90 to ~35 lines of adaptive guidelines - Shorten meta_semantic/mechanical prompts from checklists to mission + examples - Shorten adversary prompt — keep ideas as principles not rigid steps - All original ideas preserved, reframed as "adapt to this PR" not "check all of these" Cleanup: - Remove unused plan param from _run_review_layer - Cache _build_file_patches() (called 6x per run, never changes) - Consolidate _extract_findings + _extract_compound_findings into one method - Rename _run_parallel_review → _run_gap_review (only used by coverage loop) - Remove unused evidence_verifier import (replaced by verify_single_finding)

Add provider and harness_model parameters to the review() reasoner. Allows switching between opencode/claude-code and different models per API call without restarting the container. Defaults to env vars.

- Fix .ai() model mapping: claude-code short names (sonnet, haiku, opus) now correctly map to anthropic/claude-sonnet-4-6 for OpenRouter or claude-sonnet-4-6 for direct Anthropic API - Add ANTHROPIC_API_KEY support in docker-compose - Rename env var PR_AF → PR_AF_NODE_ID (backward compat kept) - Make HARNESS_PROVIDER overridable via .env - Update README: add Model & Provider Selection section with examples, add Environment Variables reference table - Update ARCHITECTURE.md: reflect streaming meta→review pipeline, parallel evidence verification, runtime config section

…mapping Remove custom model ID mapping. AgentField handles provider routing. Users pass full model IDs (e.g., openrouter/moonshotai/kimi-k2.5, claude-sonnet-4-6). README updated with correct examples.

Replace global app.harness_config mutation with Python contextvars. Each concurrent review gets its own context — no shared state. - Add runtime_config.py: set_runtime_overrides(), get_harness_kwargs(), get_ai_kwargs() using contextvars (standard Python async pattern) - review() sets overrides via set_runtime_overrides() - All 14 .harness() calls spread **get_harness_kwargs() - All 4 .ai() calls spread **get_ai_kwargs() - When no override is set, kwargs are empty → SDK uses defaults - Multiple concurrent reviews with different models just work

harness_model is for the harness provider SDK (e.g., claude-sonnet-4-6). ai_model is for .ai() calls via litellm (e.g., anthropic/claude-sonnet-4-6). When ai_model is not set, .ai() uses the container's default from env vars.

- Install claude-agent-sdk for claude-code provider support - Document that ai_model uses litellm format (openrouter/provider/model) while harness_model uses the provider's native format - Add ai_model to API parameter table in README

The v3 precision improvements (Phase A/B/C) dropped recall from 67% to 22% on the benchmark. The precision gates were too aggressive, filtering real bugs along with noise. Reverted: - Confidence thresholds back to original (0.2/0.3/0.4/0.4) - min_severity back to "nitpick", max_comments back to 25 - max_coverage_iterations back to 2 - Removed relevance gate (was dropping real findings) - Removed semantic dedup and output calibration gate from _synthesize Kept (architectural improvements that don't affect recall): - Streaming meta→review pipeline - Parallel per-finding evidence verification - Archei-compliant data flows (narrative + reference keys) - Challenged-findings leak fix (real bug fix) - Runtime provider/model via contextvars - Adaptive prompts - All code cleanup (consolidated extractors, cached patches, etc.)

The shortened "adaptive" prompts removed useful signal that weaker models need. Verbose prompts don't hurt stronger models (they ignore irrelevant hints) but they prevent false negatives for weaker models. Restored verbose prompts for: meta_semantic, meta_mechanical, review_dimension, evidence_verifier, adversary_phase. Kept adaptive: meta_systemic (intentionally different from main). Kept all architecture: get_harness_kwargs(), archei narrative flows, parallel verification, streaming pipeline, runtime config.

…vider

…c/ prefix

…refix

…s default

…ng, early exits Three-tier model routing infrastructure: - PhaseProviderConfig + ModelTierConfig in config.py for per-phase provider+model - get_harness_kwargs_for(phase) in runtime_config.py with priority chain: user global override > explicit phase_config > tier-based routing > env defaults - resolve_all_phase_routing() maps complexity to budget/mid/premium tiers - Contextual upgrades: AI-generated PRs and security signals → premium tier - API accepts phase_config and tier_config for full runtime control Dynamic .ai() skip gates: - AnatomySkipGate: skip semantic analysis for trivial PRs (~25% skip rate) - LensSkipGate: skip irrelevant meta-selector lenses (~33% skip rate) Programmatic early exits (no LLM cost): - Zero dimensions → APPROVE immediately - Zero findings → skip verification/adversary/compound - Low-severity high-confidence findings → skip verification - All low-severity → skip adversary entirely - Already-falsified findings → exclude from adversary batches - Compound threshold raised to < 3 findings - Coverage loop pre-check: all clusters covered → skip LLM gate - 80% budget/timeout soft cap → skip optional phases Activate dead-code gates: - finding_relevance_gate as pre-filter after each reviewer - batch_semantic_dedup in synthesis after exact dedup - output_calibration_gate in synthesis after scoring Default model config: Kimi k2.5 (premium), Gemini Flash (gates/scouts)

…gap finder Four improvements to the review pipeline targeting recall and precision: 1. Research Brief (Phase 2.5): .harness() deep-reads entire diff before scout fragmentation, identifies danger zones and investigation directives that flow to all downstream phases. 2. Cross-Cluster Code Patches: Scouts receive actual diff hunks from files in OTHER clusters that have dependency relationships, not just narrative edge descriptions. Computed programmatically from dependency graph. 3. Adversarial Gap Finder: .harness() runs parallel with adversary, receives research brief danger zones + existing findings, hunts for missed bugs. 4. Research Brief as Coverage Requirement: Danger zones flow to strategist (ensures dimensions cover them) and coverage gate (flags unaddressed zones). Also fixes: - Evidence verification semaphore: caps concurrent harness calls at max_concurrent_reviewers to prevent container OOM with claude-code provider. Benchmark results (TrueNAS PR #18291, 9-bug ground truth): - Sonnet v3r: 98% precision, 44% recall, F1 61% (up from F1 23% old arch) - Research brief produced 10 danger zones + 10 directives with Sonnet - Gap finder contributed 4 novel findings - minimax: research brief returned empty (model limitation), no recall change

For each inline comment about to be posted to GitHub, fire a parallel .ai() call that rewrites the body to be concise and developer-focused — imperative opening, concrete failure mode, code refs preserved. - new src/pr_af/polish.py (~50 lines): polish_comments(app, comments) runs asyncio.gather over per-comment .ai() rewrites. On any per-call failure the original body is kept. - CommentConfig.polish_enabled = True toggle. - One insert in orchestrator._generate_output right before review build.

…mary Production noise problem: every critical-severity finding triggered REQUEST_CHANGES, even when the finding was pedantic (wrong test mock signature, defensive-hardening suggestion, edge case unreachable from prod). Authors couldn't tell what actually blocked their PR vs. what was advisory commentary. Architecture - New orthogonal `blocking: bool` axis on ScoredFinding. Severity is the reviewer's "how bad" label; blocking is the release manager's "must fix before merging" verdict. A critical-severity finding can be advisory; an important-severity finding can block. - New `src/pr_af/merge_gate.py`: parallel `.ai()` pass over scored findings. Tight release-manager prompt — blocks only on build break, reachable security vulnerability, data loss/corruption, public-contract regression, or behavior regression. Everything else stays advisory. - `response_format="json"` forces clean output from reasoning models; on parse/HTTP failure, default to advisory (safer than over-blocking). - `determine_review_event` rewired to use `blocking`, not severity. Summary card redesign (was: severity-count one-liner that conflated "important" with "must-fix before merge") - Headline leads with the merge verdict (Safe to merge / Merge blocked). - Must-fix section is prominent and includes the gate's one-line reason per finding. - Advisory section is a compact collapsed table — visual noise gone. - Single "About this review" details block holds the gate explainer and pipeline internals (replaces two separate stats cards). - Footer compressed to one `<sub>` line with severity reference + cost + duration + branding. Inline comment redesign - GFM native alerts: `> [!CAUTION]` (red bar) for must-fix, `> [!NOTE]` (blue bar) for advisory. Visual signal before any text is read. - Evidence moved into a `<details>` block to compress dominant noise. - Severity badge demoted to a small subtitle line — the alert already signaled "act on this" vs "informational". Polish prompt updated to preserve callouts, details blocks, and the subtitle line verbatim. Validation - Unit tests (13/13 pass): _parse_verdict tolerance, determine_review_event mapping, classify_findings integration, safe-default-on-failure behavior. - Replay against benchmark/agentfield-254 findings (deepseek/deepseek-v4-pro via OpenRouter): 2 blocking (auth holes), 18 advisory including the 3 pedantic criticals correctly demoted (test mocks, attack chain depending on Finding 1). - Replay against benchmark/truenas-middleware-18291 findings: 3 blocking (data-loss regressions, behavior regression with no migration path), 7 advisory. - Gate cost ~$0.05/PR. Negligible. Iteration scaffolding (won't ship to CI but useful for prompt tuning) - scripts/replay_merge_gate.py: replay gate against any benchmark JSON using deepseek/deepseek-v4-pro directly via OpenRouter. - scripts/render_summary_preview.py: render the new summary card from any benchmark JSON with a synthetic gate verdict. - scripts/render_inline_preview.py: render a sample inline comment body for visual verification of the GFM alert rendering.

Two production-run regressions discovered during live PR-AF runs against the agentfield repo on docker-compose: 1. Reviewer LLMs sometimes emit severity in uppercase ("IMPORTANT") or as legacy aliases ("high"/"medium"/"low"). The downstream emoji lookup, by_severity counting, and severity_rank gate all use canonical lowercase keys, so unnormalized values silently disappeared from the summary breakdown ("0 imp" while the finding was clearly important). Add a normalization step in score_findings that maps both casing variants and common aliases to the canonical critical/important/ suggestion/nitpick rubric. 2. A previous failed run can leave a non-empty target_dir without `.git`. The clone path then crashes with "destination already exists and is not an empty directory" instead of retrying. Clear the stale directory when present and re-clone.

santoshkumarradha added 26 commits March 17, 2026 09:47

fix: use dynamic NODE_ID in webhook _fire_review instead of hardcoded…

813b1c2

… "pr-af" Allows running multiple PR-AF nodes (e.g. pr-af, pr-af-v2) on the same control plane. The webhook handler now routes to the correct node based on the PR_AF environment variable.

feat: runtime provider/model override via review API

bdb97b6

Add provider and harness_model parameters to the review() reasoner. Allows switching between opencode/claude-code and different models per API call without restarting the container. Defaults to env vars.

fix: simplify model override — pass through to AgentField, no custom …

dcfbbf2

…mapping Remove custom model ID mapping. AgentField handles provider routing. Users pass full model IDs (e.g., openrouter/moonshotai/kimi-k2.5, claude-sonnet-4-6). README updated with correct examples.

feat: pass CLAUDE_CODE_OAUTH_TOKEN to harness env for claude-code pro…

a37c43a

…vider

fix: inject ANTHROPIC_API_KEY for litellm when ai_model uses anthropi…

f5e10b1

…c/ prefix

fix: remove api_base override — litellm auto-detects for anthropic/ p…

4baee68

…refix

feat: install Claude Code CLI for claude-code harness provider, set a…

7a13a4f

…s default

chore: compress README images for faster loading

e0cac3a

chore: clean merge gate PR artifacts

955722a

Merge origin/main into severity gating branch

a88ddce

chore: trim stale gate artifacts

af1b7f2

AbirAbbas merged commit 791674d into main Jun 10, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PR-AF reviews merge-gate aware#44

Make PR-AF reviews merge-gate aware#44
AbirAbbas merged 26 commits into
mainfrom
feat/severity-gating-blocking

santoshkumarradha commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

santoshkumarradha commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

santoshkumarradha commented Jun 9, 2026 •

edited

Loading