Skip to content

Make PR-AF reviews merge-gate aware#44

Merged
AbirAbbas merged 26 commits into
mainfrom
feat/severity-gating-blocking
Jun 10, 2026
Merged

Make PR-AF reviews merge-gate aware#44
AbirAbbas merged 26 commits into
mainfrom
feat/severity-gating-blocking

Conversation

@santoshkumarradha

@santoshkumarradha santoshkumarradha commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

  • Reviews now decide REQUEST_CHANGES from an explicit merge-gate verdict instead of severity labels alone, so advisory findings can still be posted without blocking a merge.
  • Inline comments and the review summary now show must-fix vs advisory status before GitHub posting, including safe-to-merge messaging when findings are non-blocking.
  • Comment polish runs through the existing .ai() configuration before posting, and stale benchmark output artifacts are removed from the repository.

Notes

  • Keeps the existing HITL flow intact; the merge gate runs after synthesis and before HITL/output.
  • Avoids adding public model/provider guidance to docs so deployment choices can evolve outside the README.

Validation

  • uv run pytest -q
  • uv run ruff check .

Phase A - Quick Wins:
- Raise confidence thresholds (critical/important: 0.5, suggestion: 0.7, nitpick: 1.0)
- Fix challenged-findings leak: return confirmed_findings not all_findings
- Refocus systemic lens via adaptive meta-prompt (functional risks only)
- Disable coverage loop (max_coverage_iterations: 0)
- Drop nitpicks (min_severity: suggestion), cap comments at 15

Phase B - Core Precision:
- Add FindingRelevanceGate (.ai() classifier) after evidence verification
- Expand evidence verification to ALL findings (not just critical/important)
- Remove adversary [:20] cap, dynamic batching covers all findings
- Adversary batch_size: 10, max_batches: 0 (dynamic)

Phase C - Dedup & Calibration:
- Add batch_semantic_dedup (.harness()) for >8 findings
- Add OutputCalibrationGate (.ai()) final precision filter

Infra:
- Consolidate model config to HARNESS_MODEL env var as single source
- Add entrypoint.sh for runtime opencode config generation
- Default webhook dry_run to true for safety
- Make AGENTFIELD_SERVER and AGENT_CALLBACK_URL overridable via .env
Streaming Meta→Review Pipeline (Change 1):
- Add _run_meta_selectors_streaming: emits dimensions to queue as each
  lens completes instead of blocking until all 3 finish
- Add _run_streaming_review: consumes dims from queue and launches
  reviewers immediately as they arrive
- Incremental cross-meta dedup with lock protection
- Phases 3+4 now overlap — reviews start when first lens finishes
- ~30% speedup on large repos (12 min → ~8 min)

Archei-Compliant Data Flows (Change 3):
- Add _format_findings_for_llm(): formats findings as natural language
  with [F1], [F2] reference keys instead of JSON dumps
- Add _build_reference_key_map(): maps reference keys back to titles
- Evidence verifier now receives natural language (archei: context for
  LLM → string) instead of model_dump() JSON
- Adversary phase uses narrative format with reference keys
- Compound finder uses narrative format
- Add reference_key field to _VerifiedFinding and AdversaryResult schemas
- Orchestrator handles both reference_key and title matching (backward compat)
- OutputCalibrationGate: .ai() for ≤5 findings, .harness() fallback for >5

DAG Visibility:
- Add @router.reasoner() to finding_relevance_gate, output_calibration_gate,
  batch_semantic_dedup so they appear in AgentField UI graph
- Each reasoner wraps one logical intelligence unit (.ai() or .harness())
… "pr-af"

Allows running multiple PR-AF nodes (e.g. pr-af, pr-af-v2) on the same
control plane. The webhook handler now routes to the correct node based
on the PR_AF environment variable.
Replace single serial evidence_verifier .harness() call with parallel
per-finding verify_single_finding reasoners. Each finding gets its own
.harness() that browses the repo independently to verify claims.

- Add verify_single_finding @router.reasoner(): verifies one finding
  against actual source code, returns flat _VerifiedFinding schema
- Orchestrator launches asyncio.gather(*[verify_one(f) for f in findings])
- Each verification shows as a child node in AgentField UI DAG
- ~3x speedup on verification phase (10 min serial → ~3 min parallel)
- No quality change: same verification logic, same schema, same result
Prompts:
- Shorten review_dimension prompt from 90 to ~35 lines of adaptive guidelines
- Shorten meta_semantic/mechanical prompts from checklists to mission + examples
- Shorten adversary prompt — keep ideas as principles not rigid steps
- All original ideas preserved, reframed as "adapt to this PR" not "check all of these"

Cleanup:
- Remove unused plan param from _run_review_layer
- Cache _build_file_patches() (called 6x per run, never changes)
- Consolidate _extract_findings + _extract_compound_findings into one method
- Rename _run_parallel_review → _run_gap_review (only used by coverage loop)
- Remove unused evidence_verifier import (replaced by verify_single_finding)
Add provider and harness_model parameters to the review() reasoner.
Allows switching between opencode/claude-code and different models
per API call without restarting the container. Defaults to env vars.
- Fix .ai() model mapping: claude-code short names (sonnet, haiku, opus)
  now correctly map to anthropic/claude-sonnet-4-6 for OpenRouter or
  claude-sonnet-4-6 for direct Anthropic API
- Add ANTHROPIC_API_KEY support in docker-compose
- Rename env var PR_AF → PR_AF_NODE_ID (backward compat kept)
- Make HARNESS_PROVIDER overridable via .env
- Update README: add Model & Provider Selection section with examples,
  add Environment Variables reference table
- Update ARCHITECTURE.md: reflect streaming meta→review pipeline,
  parallel evidence verification, runtime config section
…mapping

Remove custom model ID mapping. AgentField handles provider routing.
Users pass full model IDs (e.g., openrouter/moonshotai/kimi-k2.5,
claude-sonnet-4-6). README updated with correct examples.
Replace global app.harness_config mutation with Python contextvars.
Each concurrent review gets its own context — no shared state.

- Add runtime_config.py: set_runtime_overrides(), get_harness_kwargs(),
  get_ai_kwargs() using contextvars (standard Python async pattern)
- review() sets overrides via set_runtime_overrides()
- All 14 .harness() calls spread **get_harness_kwargs()
- All 4 .ai() calls spread **get_ai_kwargs()
- When no override is set, kwargs are empty → SDK uses defaults
- Multiple concurrent reviews with different models just work
harness_model is for the harness provider SDK (e.g., claude-sonnet-4-6).
ai_model is for .ai() calls via litellm (e.g., anthropic/claude-sonnet-4-6).
When ai_model is not set, .ai() uses the container's default from env vars.
- Install claude-agent-sdk for claude-code provider support
- Document that ai_model uses litellm format (openrouter/provider/model)
  while harness_model uses the provider's native format
- Add ai_model to API parameter table in README
The v3 precision improvements (Phase A/B/C) dropped recall from 67% to 22%
on the benchmark. The precision gates were too aggressive, filtering real
bugs along with noise.

Reverted:
- Confidence thresholds back to original (0.2/0.3/0.4/0.4)
- min_severity back to "nitpick", max_comments back to 25
- max_coverage_iterations back to 2
- Removed relevance gate (was dropping real findings)
- Removed semantic dedup and output calibration gate from _synthesize

Kept (architectural improvements that don't affect recall):
- Streaming meta→review pipeline
- Parallel per-finding evidence verification
- Archei-compliant data flows (narrative + reference keys)
- Challenged-findings leak fix (real bug fix)
- Runtime provider/model via contextvars
- Adaptive prompts
- All code cleanup (consolidated extractors, cached patches, etc.)
The shortened "adaptive" prompts removed useful signal that weaker models
need. Verbose prompts don't hurt stronger models (they ignore irrelevant
hints) but they prevent false negatives for weaker models.

Restored verbose prompts for: meta_semantic, meta_mechanical,
review_dimension, evidence_verifier, adversary_phase.

Kept adaptive: meta_systemic (intentionally different from main).
Kept all architecture: get_harness_kwargs(), archei narrative flows,
parallel verification, streaming pipeline, runtime config.
…ng, early exits

Three-tier model routing infrastructure:
- PhaseProviderConfig + ModelTierConfig in config.py for per-phase provider+model
- get_harness_kwargs_for(phase) in runtime_config.py with priority chain:
  user global override > explicit phase_config > tier-based routing > env defaults
- resolve_all_phase_routing() maps complexity to budget/mid/premium tiers
- Contextual upgrades: AI-generated PRs and security signals → premium tier
- API accepts phase_config and tier_config for full runtime control

Dynamic .ai() skip gates:
- AnatomySkipGate: skip semantic analysis for trivial PRs (~25% skip rate)
- LensSkipGate: skip irrelevant meta-selector lenses (~33% skip rate)

Programmatic early exits (no LLM cost):
- Zero dimensions → APPROVE immediately
- Zero findings → skip verification/adversary/compound
- Low-severity high-confidence findings → skip verification
- All low-severity → skip adversary entirely
- Already-falsified findings → exclude from adversary batches
- Compound threshold raised to < 3 findings
- Coverage loop pre-check: all clusters covered → skip LLM gate
- 80% budget/timeout soft cap → skip optional phases

Activate dead-code gates:
- finding_relevance_gate as pre-filter after each reviewer
- batch_semantic_dedup in synthesis after exact dedup
- output_calibration_gate in synthesis after scoring

Default model config: Kimi k2.5 (premium), Gemini Flash (gates/scouts)
…gap finder

Four improvements to the review pipeline targeting recall and precision:

1. Research Brief (Phase 2.5): .harness() deep-reads entire diff before
   scout fragmentation, identifies danger zones and investigation directives
   that flow to all downstream phases.

2. Cross-Cluster Code Patches: Scouts receive actual diff hunks from files
   in OTHER clusters that have dependency relationships, not just narrative
   edge descriptions. Computed programmatically from dependency graph.

3. Adversarial Gap Finder: .harness() runs parallel with adversary, receives
   research brief danger zones + existing findings, hunts for missed bugs.

4. Research Brief as Coverage Requirement: Danger zones flow to strategist
   (ensures dimensions cover them) and coverage gate (flags unaddressed zones).

Also fixes:
- Evidence verification semaphore: caps concurrent harness calls at
  max_concurrent_reviewers to prevent container OOM with claude-code provider.

Benchmark results (TrueNAS PR #18291, 9-bug ground truth):
- Sonnet v3r: 98% precision, 44% recall, F1 61% (up from F1 23% old arch)
- Research brief produced 10 danger zones + 10 directives with Sonnet
- Gap finder contributed 4 novel findings
- minimax: research brief returned empty (model limitation), no recall change
For each inline comment about to be posted to GitHub, fire a parallel
.ai() call that rewrites the body to be concise and developer-focused —
imperative opening, concrete failure mode, code refs preserved.

- new src/pr_af/polish.py (~50 lines): polish_comments(app, comments)
  runs asyncio.gather over per-comment .ai() rewrites. On any per-call
  failure the original body is kept.
- CommentConfig.polish_enabled = True toggle.
- One insert in orchestrator._generate_output right before review build.
…mary

Production noise problem: every critical-severity finding triggered
REQUEST_CHANGES, even when the finding was pedantic (wrong test mock
signature, defensive-hardening suggestion, edge case unreachable from
prod). Authors couldn't tell what actually blocked their PR vs. what
was advisory commentary.

Architecture
- New orthogonal `blocking: bool` axis on ScoredFinding. Severity is
  the reviewer's "how bad" label; blocking is the release manager's
  "must fix before merging" verdict. A critical-severity finding can
  be advisory; an important-severity finding can block.
- New `src/pr_af/merge_gate.py`: parallel `.ai()` pass over scored
  findings. Tight release-manager prompt — blocks only on build
  break, reachable security vulnerability, data loss/corruption,
  public-contract regression, or behavior regression. Everything
  else stays advisory.
- `response_format="json"` forces clean output from reasoning models;
  on parse/HTTP failure, default to advisory (safer than over-blocking).
- `determine_review_event` rewired to use `blocking`, not severity.

Summary card redesign (was: severity-count one-liner that conflated
"important" with "must-fix before merge")
- Headline leads with the merge verdict (Safe to merge / Merge blocked).
- Must-fix section is prominent and includes the gate's one-line
  reason per finding.
- Advisory section is a compact collapsed table — visual noise gone.
- Single "About this review" details block holds the gate explainer
  and pipeline internals (replaces two separate stats cards).
- Footer compressed to one `<sub>` line with severity reference +
  cost + duration + branding.

Inline comment redesign
- GFM native alerts: `> [!CAUTION]` (red bar) for must-fix,
  `> [!NOTE]` (blue bar) for advisory. Visual signal before any
  text is read.
- Evidence moved into a `<details>` block to compress dominant noise.
- Severity badge demoted to a small subtitle line — the alert already
  signaled "act on this" vs "informational".

Polish prompt updated to preserve callouts, details blocks, and the
subtitle line verbatim.

Validation
- Unit tests (13/13 pass): _parse_verdict tolerance,
  determine_review_event mapping, classify_findings integration,
  safe-default-on-failure behavior.
- Replay against benchmark/agentfield-254 findings (deepseek/deepseek-v4-pro
  via OpenRouter): 2 blocking (auth holes), 18 advisory including the
  3 pedantic criticals correctly demoted (test mocks, attack chain
  depending on Finding 1).
- Replay against benchmark/truenas-middleware-18291 findings: 3
  blocking (data-loss regressions, behavior regression with no
  migration path), 7 advisory.
- Gate cost ~$0.05/PR. Negligible.

Iteration scaffolding (won't ship to CI but useful for prompt tuning)
- scripts/replay_merge_gate.py: replay gate against any benchmark JSON
  using deepseek/deepseek-v4-pro directly via OpenRouter.
- scripts/render_summary_preview.py: render the new summary card
  from any benchmark JSON with a synthetic gate verdict.
- scripts/render_inline_preview.py: render a sample inline comment
  body for visual verification of the GFM alert rendering.
Two production-run regressions discovered during live PR-AF runs against
the agentfield repo on docker-compose:

1. Reviewer LLMs sometimes emit severity in uppercase ("IMPORTANT") or
   as legacy aliases ("high"/"medium"/"low"). The downstream emoji
   lookup, by_severity counting, and severity_rank gate all use canonical
   lowercase keys, so unnormalized values silently disappeared from the
   summary breakdown ("0 imp" while the finding was clearly important).
   Add a normalization step in score_findings that maps both casing
   variants and common aliases to the canonical critical/important/
   suggestion/nitpick rubric.

2. A previous failed run can leave a non-empty target_dir without `.git`.
   The clone path then crashes with "destination already exists and is
   not an empty directory" instead of retrying. Clear the stale directory
   when present and re-clone.
@AbirAbbas AbirAbbas merged commit 791674d into main Jun 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants