Make PR-AF reviews merge-gate aware#44
Merged
Merged
Conversation
Phase A - Quick Wins: - Raise confidence thresholds (critical/important: 0.5, suggestion: 0.7, nitpick: 1.0) - Fix challenged-findings leak: return confirmed_findings not all_findings - Refocus systemic lens via adaptive meta-prompt (functional risks only) - Disable coverage loop (max_coverage_iterations: 0) - Drop nitpicks (min_severity: suggestion), cap comments at 15 Phase B - Core Precision: - Add FindingRelevanceGate (.ai() classifier) after evidence verification - Expand evidence verification to ALL findings (not just critical/important) - Remove adversary [:20] cap, dynamic batching covers all findings - Adversary batch_size: 10, max_batches: 0 (dynamic) Phase C - Dedup & Calibration: - Add batch_semantic_dedup (.harness()) for >8 findings - Add OutputCalibrationGate (.ai()) final precision filter Infra: - Consolidate model config to HARNESS_MODEL env var as single source - Add entrypoint.sh for runtime opencode config generation - Default webhook dry_run to true for safety - Make AGENTFIELD_SERVER and AGENT_CALLBACK_URL overridable via .env
Streaming Meta→Review Pipeline (Change 1): - Add _run_meta_selectors_streaming: emits dimensions to queue as each lens completes instead of blocking until all 3 finish - Add _run_streaming_review: consumes dims from queue and launches reviewers immediately as they arrive - Incremental cross-meta dedup with lock protection - Phases 3+4 now overlap — reviews start when first lens finishes - ~30% speedup on large repos (12 min → ~8 min) Archei-Compliant Data Flows (Change 3): - Add _format_findings_for_llm(): formats findings as natural language with [F1], [F2] reference keys instead of JSON dumps - Add _build_reference_key_map(): maps reference keys back to titles - Evidence verifier now receives natural language (archei: context for LLM → string) instead of model_dump() JSON - Adversary phase uses narrative format with reference keys - Compound finder uses narrative format - Add reference_key field to _VerifiedFinding and AdversaryResult schemas - Orchestrator handles both reference_key and title matching (backward compat) - OutputCalibrationGate: .ai() for ≤5 findings, .harness() fallback for >5 DAG Visibility: - Add @router.reasoner() to finding_relevance_gate, output_calibration_gate, batch_semantic_dedup so they appear in AgentField UI graph - Each reasoner wraps one logical intelligence unit (.ai() or .harness())
… "pr-af" Allows running multiple PR-AF nodes (e.g. pr-af, pr-af-v2) on the same control plane. The webhook handler now routes to the correct node based on the PR_AF environment variable.
Replace single serial evidence_verifier .harness() call with parallel per-finding verify_single_finding reasoners. Each finding gets its own .harness() that browses the repo independently to verify claims. - Add verify_single_finding @router.reasoner(): verifies one finding against actual source code, returns flat _VerifiedFinding schema - Orchestrator launches asyncio.gather(*[verify_one(f) for f in findings]) - Each verification shows as a child node in AgentField UI DAG - ~3x speedup on verification phase (10 min serial → ~3 min parallel) - No quality change: same verification logic, same schema, same result
Prompts: - Shorten review_dimension prompt from 90 to ~35 lines of adaptive guidelines - Shorten meta_semantic/mechanical prompts from checklists to mission + examples - Shorten adversary prompt — keep ideas as principles not rigid steps - All original ideas preserved, reframed as "adapt to this PR" not "check all of these" Cleanup: - Remove unused plan param from _run_review_layer - Cache _build_file_patches() (called 6x per run, never changes) - Consolidate _extract_findings + _extract_compound_findings into one method - Rename _run_parallel_review → _run_gap_review (only used by coverage loop) - Remove unused evidence_verifier import (replaced by verify_single_finding)
Add provider and harness_model parameters to the review() reasoner. Allows switching between opencode/claude-code and different models per API call without restarting the container. Defaults to env vars.
- Fix .ai() model mapping: claude-code short names (sonnet, haiku, opus) now correctly map to anthropic/claude-sonnet-4-6 for OpenRouter or claude-sonnet-4-6 for direct Anthropic API - Add ANTHROPIC_API_KEY support in docker-compose - Rename env var PR_AF → PR_AF_NODE_ID (backward compat kept) - Make HARNESS_PROVIDER overridable via .env - Update README: add Model & Provider Selection section with examples, add Environment Variables reference table - Update ARCHITECTURE.md: reflect streaming meta→review pipeline, parallel evidence verification, runtime config section
…mapping Remove custom model ID mapping. AgentField handles provider routing. Users pass full model IDs (e.g., openrouter/moonshotai/kimi-k2.5, claude-sonnet-4-6). README updated with correct examples.
Replace global app.harness_config mutation with Python contextvars. Each concurrent review gets its own context — no shared state. - Add runtime_config.py: set_runtime_overrides(), get_harness_kwargs(), get_ai_kwargs() using contextvars (standard Python async pattern) - review() sets overrides via set_runtime_overrides() - All 14 .harness() calls spread **get_harness_kwargs() - All 4 .ai() calls spread **get_ai_kwargs() - When no override is set, kwargs are empty → SDK uses defaults - Multiple concurrent reviews with different models just work
harness_model is for the harness provider SDK (e.g., claude-sonnet-4-6). ai_model is for .ai() calls via litellm (e.g., anthropic/claude-sonnet-4-6). When ai_model is not set, .ai() uses the container's default from env vars.
- Install claude-agent-sdk for claude-code provider support - Document that ai_model uses litellm format (openrouter/provider/model) while harness_model uses the provider's native format - Add ai_model to API parameter table in README
The v3 precision improvements (Phase A/B/C) dropped recall from 67% to 22% on the benchmark. The precision gates were too aggressive, filtering real bugs along with noise. Reverted: - Confidence thresholds back to original (0.2/0.3/0.4/0.4) - min_severity back to "nitpick", max_comments back to 25 - max_coverage_iterations back to 2 - Removed relevance gate (was dropping real findings) - Removed semantic dedup and output calibration gate from _synthesize Kept (architectural improvements that don't affect recall): - Streaming meta→review pipeline - Parallel per-finding evidence verification - Archei-compliant data flows (narrative + reference keys) - Challenged-findings leak fix (real bug fix) - Runtime provider/model via contextvars - Adaptive prompts - All code cleanup (consolidated extractors, cached patches, etc.)
The shortened "adaptive" prompts removed useful signal that weaker models need. Verbose prompts don't hurt stronger models (they ignore irrelevant hints) but they prevent false negatives for weaker models. Restored verbose prompts for: meta_semantic, meta_mechanical, review_dimension, evidence_verifier, adversary_phase. Kept adaptive: meta_systemic (intentionally different from main). Kept all architecture: get_harness_kwargs(), archei narrative flows, parallel verification, streaming pipeline, runtime config.
…ng, early exits Three-tier model routing infrastructure: - PhaseProviderConfig + ModelTierConfig in config.py for per-phase provider+model - get_harness_kwargs_for(phase) in runtime_config.py with priority chain: user global override > explicit phase_config > tier-based routing > env defaults - resolve_all_phase_routing() maps complexity to budget/mid/premium tiers - Contextual upgrades: AI-generated PRs and security signals → premium tier - API accepts phase_config and tier_config for full runtime control Dynamic .ai() skip gates: - AnatomySkipGate: skip semantic analysis for trivial PRs (~25% skip rate) - LensSkipGate: skip irrelevant meta-selector lenses (~33% skip rate) Programmatic early exits (no LLM cost): - Zero dimensions → APPROVE immediately - Zero findings → skip verification/adversary/compound - Low-severity high-confidence findings → skip verification - All low-severity → skip adversary entirely - Already-falsified findings → exclude from adversary batches - Compound threshold raised to < 3 findings - Coverage loop pre-check: all clusters covered → skip LLM gate - 80% budget/timeout soft cap → skip optional phases Activate dead-code gates: - finding_relevance_gate as pre-filter after each reviewer - batch_semantic_dedup in synthesis after exact dedup - output_calibration_gate in synthesis after scoring Default model config: Kimi k2.5 (premium), Gemini Flash (gates/scouts)
…gap finder Four improvements to the review pipeline targeting recall and precision: 1. Research Brief (Phase 2.5): .harness() deep-reads entire diff before scout fragmentation, identifies danger zones and investigation directives that flow to all downstream phases. 2. Cross-Cluster Code Patches: Scouts receive actual diff hunks from files in OTHER clusters that have dependency relationships, not just narrative edge descriptions. Computed programmatically from dependency graph. 3. Adversarial Gap Finder: .harness() runs parallel with adversary, receives research brief danger zones + existing findings, hunts for missed bugs. 4. Research Brief as Coverage Requirement: Danger zones flow to strategist (ensures dimensions cover them) and coverage gate (flags unaddressed zones). Also fixes: - Evidence verification semaphore: caps concurrent harness calls at max_concurrent_reviewers to prevent container OOM with claude-code provider. Benchmark results (TrueNAS PR #18291, 9-bug ground truth): - Sonnet v3r: 98% precision, 44% recall, F1 61% (up from F1 23% old arch) - Research brief produced 10 danger zones + 10 directives with Sonnet - Gap finder contributed 4 novel findings - minimax: research brief returned empty (model limitation), no recall change
For each inline comment about to be posted to GitHub, fire a parallel .ai() call that rewrites the body to be concise and developer-focused — imperative opening, concrete failure mode, code refs preserved. - new src/pr_af/polish.py (~50 lines): polish_comments(app, comments) runs asyncio.gather over per-comment .ai() rewrites. On any per-call failure the original body is kept. - CommentConfig.polish_enabled = True toggle. - One insert in orchestrator._generate_output right before review build.
…mary Production noise problem: every critical-severity finding triggered REQUEST_CHANGES, even when the finding was pedantic (wrong test mock signature, defensive-hardening suggestion, edge case unreachable from prod). Authors couldn't tell what actually blocked their PR vs. what was advisory commentary. Architecture - New orthogonal `blocking: bool` axis on ScoredFinding. Severity is the reviewer's "how bad" label; blocking is the release manager's "must fix before merging" verdict. A critical-severity finding can be advisory; an important-severity finding can block. - New `src/pr_af/merge_gate.py`: parallel `.ai()` pass over scored findings. Tight release-manager prompt — blocks only on build break, reachable security vulnerability, data loss/corruption, public-contract regression, or behavior regression. Everything else stays advisory. - `response_format="json"` forces clean output from reasoning models; on parse/HTTP failure, default to advisory (safer than over-blocking). - `determine_review_event` rewired to use `blocking`, not severity. Summary card redesign (was: severity-count one-liner that conflated "important" with "must-fix before merge") - Headline leads with the merge verdict (Safe to merge / Merge blocked). - Must-fix section is prominent and includes the gate's one-line reason per finding. - Advisory section is a compact collapsed table — visual noise gone. - Single "About this review" details block holds the gate explainer and pipeline internals (replaces two separate stats cards). - Footer compressed to one `<sub>` line with severity reference + cost + duration + branding. Inline comment redesign - GFM native alerts: `> [!CAUTION]` (red bar) for must-fix, `> [!NOTE]` (blue bar) for advisory. Visual signal before any text is read. - Evidence moved into a `<details>` block to compress dominant noise. - Severity badge demoted to a small subtitle line — the alert already signaled "act on this" vs "informational". Polish prompt updated to preserve callouts, details blocks, and the subtitle line verbatim. Validation - Unit tests (13/13 pass): _parse_verdict tolerance, determine_review_event mapping, classify_findings integration, safe-default-on-failure behavior. - Replay against benchmark/agentfield-254 findings (deepseek/deepseek-v4-pro via OpenRouter): 2 blocking (auth holes), 18 advisory including the 3 pedantic criticals correctly demoted (test mocks, attack chain depending on Finding 1). - Replay against benchmark/truenas-middleware-18291 findings: 3 blocking (data-loss regressions, behavior regression with no migration path), 7 advisory. - Gate cost ~$0.05/PR. Negligible. Iteration scaffolding (won't ship to CI but useful for prompt tuning) - scripts/replay_merge_gate.py: replay gate against any benchmark JSON using deepseek/deepseek-v4-pro directly via OpenRouter. - scripts/render_summary_preview.py: render the new summary card from any benchmark JSON with a synthetic gate verdict. - scripts/render_inline_preview.py: render a sample inline comment body for visual verification of the GFM alert rendering.
Two production-run regressions discovered during live PR-AF runs against
the agentfield repo on docker-compose:
1. Reviewer LLMs sometimes emit severity in uppercase ("IMPORTANT") or
as legacy aliases ("high"/"medium"/"low"). The downstream emoji
lookup, by_severity counting, and severity_rank gate all use canonical
lowercase keys, so unnormalized values silently disappeared from the
summary breakdown ("0 imp" while the finding was clearly important).
Add a normalization step in score_findings that maps both casing
variants and common aliases to the canonical critical/important/
suggestion/nitpick rubric.
2. A previous failed run can leave a non-empty target_dir without `.git`.
The clone path then crashes with "destination already exists and is
not an empty directory" instead of retrying. Clear the stale directory
when present and re-clone.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
REQUEST_CHANGESfrom an explicit merge-gate verdict instead of severity labels alone, so advisory findings can still be posted without blocking a merge..ai()configuration before posting, and stale benchmark output artifacts are removed from the repository.Notes
Validation
uv run pytest -quv run ruff check .