feat(security-assessment): Phase 1b expansion + fp-reduction Stage 0/confidence (v2.2.0) by bdfinst · Pull Request #25 · bdfinst/agentic-dev-team

bdfinst · 2026-05-01T18:33:52Z

Summary

Expands Phase 1b of /security-assessment from 2 agents to 5, plus adds Stage 0 devil's-advocate reasoning and an explicit confidence field to fp-reduction. Closes detection gaps surfaced by (a) competitive analysis against Anthropic's Claude Code Security and (b) a NextGen portfolio rerun that found 75 real findings the original SAST-only Phase 1 missed.

Changes

`fp-reduction` enhancements (commit 1)

Stage 0 devil's advocate — every disposition entry carries da_rationale (≥20 chars) + da_strong (bool) before Stages 1–5 run. Forces the agent to argue against the finding being real (framework protection, trusted caller, non-prod, rule pattern noise) and sharpens Stage 1 reachability into a hypothesis test.
Confidence field — every entry carries confidence: high | medium | low | null derived from verdict × exploitability score. Surfaced in exec-report Section 1 dashboard column and Section 2 detail blocks.

Three new Phase 1b judgment agents (commit 2)

Agent	Approach	Catches
`deep-code-reasoning`	Bottom-up at RECON entry points	IDOR, confused deputy, TOCTOU across services, indirect privilege escalation, workflow bypass
`authorization-logic-review`	Top-down: maps intended auth model, checks layered enforcement	Front-door-only auth, multi-tenancy filter inconsistency, role escalation via mutable fields
`recon-driven-scan`	RECON narrative → targeted grep	Inverted-boolean TLS defaults, RCE shapes via expression libs, header-driven SQL, body-trusted IDOR, masker exception PII fallback

All three emit unified-finding-v1 directly; no adapter required. recon-driven-scan legitimately emits [] for repos with empty/generic RECON.

Validation history

recon-driven-scan was validated against the 2026-05-01 NextGen portfolio rerun (12 repos previously scored zero-findings by Phase 1 SAST):

75 new findings (8 CRITICAL, 17 HIGH) with zero false alarms
Notable: 2 production SQL injections in `search-service` (LIKE concat); RCE shape via Flee + Dynamic LINQ in `profile-custompipes` running in-process inside `profile-service`; inverted-boolean TLS bypass library-amplified across all consumer Lambdas in `notificationinfrastructure`; `Jupiter2020$` cross-repo credential reuse expanded to 12+ repos
All 12 repos promoted out of `00-no-findings.md`

Wiring

`skills/security-assessment-pipeline/SKILL.md` — Phase 1b artifact list + agent count (2 → 5)
`commands/security-assessment.md` — 5-agent parallel dispatch + jq append step + parallelization rule
`agents/exec-report-generator.md` — agent→phase mapping + Confidence column/line
`CLAUDE.md` — agent registry (11 → 12)
`.claude-plugin/plugin.json` — version 2.1.0 → 2.2.0

Test plan

`/agent-audit` passes (structural compliance for the three new agents)
`/agent-eval` against existing security-review fixtures still pass
Manual: invoke `/security-assessment ` on a small target and verify `memory/{deep-reasoning,authz-review,recon-driven}-.json` are produced and append cleanly to findings JSONL
Verify exec-report-generator renders Confidence column in Section 1 and Confidence line in Section 2 detail blocks

🤖 Generated with Claude Code

…field to fp-reduction Two enhancements to the FP-reduction rubric, surfaced from a competitive analysis against Anthropic's Claude Code Security: - Stage 0 devil's advocate: every disposition entry now carries da_rationale (≥20 chars) and da_strong (bool) before Stages 1-5 run. The pre-pass forces the agent to argue the strongest case AGAINST the finding being real (framework protection, trusted caller, non-prod context, rule pattern noise) - a strong DA argument sharpens Stage 1 reachability into a hypothesis test rather than open-ended search, and a true_positive that explicitly refutes a counter-argument is more trustworthy than one that never examined the counter-case. - Confidence field: every disposition entry now carries a confidence band (high | medium | low | null) derived from the verdict × exploitability score table: true_positive + score 7-10 → high true_positive + score 0-6 → medium likely_true_positive → medium uncertain → low likely_false_positive / false_positive → null Confidence is consumed by exec-report-generator for the Section 1 dashboard column and Section 2 detail blocks. severity-floors.json documents the bands as informational metadata.

Phase 1b now dispatches 5 agents in parallel (was 2). Three new opus agents address gaps surfaced by competitive analysis and a NextGen portfolio rerun: - deep-code-reasoning: RECON surface-scoped freeform vulnerability reasoning - bottom-up analysis at entry points / auth paths / data-flow boundaries for novel context-dependent issues that static rules cannot express (IDOR, confused deputy, TOCTOU across services, indirect privilege escalation, workflow bypass). Minimum evidence bar: ≥2 file:line citations per finding; only emits high/medium confidence (no low-confidence noise). - authorization-logic-review: top-down authorization architecture review. Maps the intended access control model from route decorators / permission constants / middleware, then verifies consistent enforcement at controller / service / repository layers. Catches design-intent vs. implementation gaps (auth at front door but not at data-access layer, multi-tenancy filter inconsistency, role escalation via mutable fields). - recon-driven-scan: bridges Phase 0 RECON narrative claims to concrete file:line evidence. Reads RECON's human-language risk descriptions ("inverted-boolean TLS bypass", "RCE shape via Flee", "header-driven SQL connection-string interpolation") and validates each described risk has matching code via targeted grep. Includes a 28-pattern claim→search library with rule_id namespaces and CWE assignments. Validated against the NextGen 2026-05-01 rerun: 12 repos previously scored zero-findings by SAST were re-scanned, producing 75 confirmed findings (8 CRITICAL, 17 HIGH) with zero false alarms - including 2 production SQL injections, an in-process RCE shape via Flee+Dynamic LINQ, and an inverted-boolean TLS bypass library-amplified across all consumer Lambdas. All three new agents emit unified-finding-v1 directly (no adapter) and append to memory/findings-<slug>.jsonl via jq after Phase 1b completes. recon-driven-scan legitimately emits [] for repos with empty/generic RECON narratives - this is not a failure. Wiring updates: pipeline skill (artifacts table + Phase 1b agent list), command (5-agent dispatch + parallelization rule), exec-report-generator (agent → phase mapping), CLAUDE.md (agent registry 11 → 12).

Manual changelog entry for the Phase 1b expansion. release-please will generate the canonical 2.2.0 entry from conventional commits when this lands on main; this entry serves as a working preview.

bdfinst added 3 commits May 1, 2026 13:31

chore(security-assessment): release 2.2.0

6296245

Manual changelog entry for the Phase 1b expansion. release-please will generate the canonical 2.2.0 entry from conventional commits when this lands on main; this entry serves as a working preview.

bdfinst mentioned this pull request May 1, 2026

feat(security-assessment): recalibrate CRITICAL threshold against opus_repo_scan_test reference (v2.3.0) #26

Merged

4 tasks

bdfinst merged commit 4e77037 into main May 1, 2026
1 check passed

bdfinst deleted the feat/phase-1b-expansion branch May 1, 2026 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(security-assessment): Phase 1b expansion + fp-reduction Stage 0/confidence (v2.2.0)#25

feat(security-assessment): Phase 1b expansion + fp-reduction Stage 0/confidence (v2.2.0)#25
bdfinst merged 3 commits into
mainfrom
feat/phase-1b-expansion

bdfinst commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdfinst commented May 1, 2026

Summary

Changes

fp-reduction enhancements (commit 1)

Three new Phase 1b judgment agents (commit 2)

Validation history

Wiring

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`fp-reduction` enhancements (commit 1)