feat(metrics): add change-history (VCS) metrics with a git backend — churn, authors, ownership, bug-fix / security-fix counts, composite risk score

## Summary

Add a new family of code metrics derived from **version control history**, as a peer to the existing AST-derived metrics. The goal is to surface files that are most likely to contain **vulnerabilities OR bugs**, using the signals the empirical literature most consistently backs.

**Naming principle.** `vcs` is a generic abstraction; `git` is the **v1 backend**. Future backends (Mercurial, Jujutsu, Pijul) are plausible and must not require renaming. Generic types live under the `vcs` namespace; git-specific code lives under `vcs::git`. Cargo features mirror this split (umbrella `vcs = ["vcs-git"]`, exactly like the existing `all-languages` umbrella).

Motivation and evidence: the synthesis in `vulnerability-correlation.md` (in the repo root) plus a broader defect-prediction literature. Concrete published effect sizes include Firefox NumChanges PD 86 / PF 23, RHEL4 ≥9-developer files reported as ~16× more likely to harbor a vulnerability, Windows Vista edit-frequency ρ ≈ 0.29, Hassan's change entropy reaching Pearson 0.54 with file-level defects on Apache projects, and Nagappan & Ball's relative-churn measures forming the basis of Tornhill's well-known complexity × churn "hotspot" model.

A working shell-script prototype lives at `git-history-risk-rank.sh`; this issue replaces it with a first-class, tested, cross-platform Rust implementation and broadens the signal set.

This is the **first metric family in the project that is language-agnostic and not AST-derived**, so it also establishes the architectural pattern for any future non-AST signals.

## Scope (v1)

v1 ships **one backend** (`vcs-git`), behind the umbrella `vcs` Cargo feature. Adding a second backend later is a separate issue and does not require renaming or moving v1 code.

A single history walk produces these per-file signals over two configurable time windows (defaults: 12 months "long", 90 days "recent"):

| Field | Type | Description | Primary literature support |
|---|---|---|---|
| `commits_long` | u32 | Distinct commits touching the file in the long window | Firefox NumChanges (PD 86); Vista edit-freq ρ ≈ 0.29 |
| `commits_recent` | u32 | Same, recent window | JIT defect prediction; Firefox/RHEL4 |
| `churn_long` | u64 | Σ(added + deleted lines) in long window | Nagappan & Ball relative-churn; Firefox LinesChanged (PD 85) |
| `churn_recent` | u64 | Same, recent window | JIT defect prediction |
| `authors_long` | u32 | Distinct canonical author identities in long window | RHEL4 (≥9 → 16×); Vista NumEngineers ρ ≈ 0.26 |
| `authors_recent` | u32 | Same, recent window | Same lineage |
| `ownership_top_share` | f64 ∈ [0,1] | Share of edits in long window attributable to the top author; lower = more diluted | Avelino DoA / truck-factor heuristic |
| `burst` | f64 | `commits_recent / commits_long`, clamped to `[0, 1]` | Vista "repeat frequency" ρ ≈ 0.27 |
| `bug_fix_commits` | u32 | Long-window commits whose message matches a bug-fix keyword regex | Pascarella/Bavota commit-message classification |
| `security_fix_commits` | u32 | Long-window commits matching security keywords (CVE-####, security, vuln, exploit, sanitize, etc.) | Sentence-Level VFC studies; PySecDB |
| `revert_commits` | u32 | Long-window commits whose subject matches `^Revert ` / `rollback` | Stability proxy |
| `age_days` | u32 | Days since the file's first commit (capped at window) | Chromium "new features" risk |
| `last_modified_days` | u32 | Days since the file's most recent commit | Operational filter / staleness |
| `risk_score` | f64 | Composite, formula-versioned (see below) | Literature-derived; non-cardinal |
| `risk_score_version` | u32 | Increments any time the formula changes | Forward-compatibility |
| `hotspot_score` | Option\<f64\> | `complexity_index × churn_recent`, present only when AST metrics are also computed | Nagappan & Ball; Tornhill |
| `vcs_schema_version` | u32 | Output shape version | Forward-compatibility |

The composite score uses log-scaling on every count, weights recent churn and recent commits highest, multiplies the author factor by the ownership-dilution factor `(1 - ownership_top_share)`, treats file size as a tiny tie-breaker, and applies categorical multiplicative bumps for the RHEL4 6-developer (1.15×) and 9-developer (1.35×) thresholds plus a new-file bump (1.15× when `age_days < recent_window_days`). Bug-fix and security-fix commit counts feed in via a log-scaled additive term with double weight on security fixes. The exact formula is below and is also documented in `src/vcs/score.rs` with citations; `risk_score_version` lets it evolve without breaking downstream consumers.

An alternative percentile-based score is available via `--risk-formula percentile`: each signal is re-ranked to its percentile within the analyzed set, then averaged. The literature explicitly recommends relative/percentile triggers over hard thresholds for cross-project robustness.

## Explicitly out of scope (filed as follow-ups)

- Per-function granularity via `git blame` + AST line spans
- Change entropy and co-change graph entropy (Hassan 2009; arXiv 2504.18511, 2025)
- Just-in-time (commit-level) risk scoring (Kamei et al.)
- Directory- and repo-level bus factor (Avelino DoA)
- Full SZZ bug-inducing commit detection (developer-validated SZZ recall remains ≈0.55 even with LLM augmentation; out of scope for a metrics library)
- Historical metric trend (time series over N historical points)
- Persistent VCS history cache keyed by HEAD SHA
- CVE / advisory linkage
- Dependency graph integration
- Submodule recursion
- VCS backends other than git (Mercurial, Jujutsu, Pijul, …)

## Architecture

- New module tree under `src/vcs/`:
  - **Generic** (always compiled when any backend is enabled): `error`, `options`, `stats`, `identity`, `classify`, `score`, `hotspot`, and a `build_history_index(root, options)` entry point.
  - **Backend-specific**: `src/vcs/git/` (`repo`, `history`, `identity`) — gated by `vcs-git`.
- v1 does **not** introduce a `Backend` trait (premature abstraction with one backend). The top-level entry point delegates to the single available backend; the trait is extracted when a second backend lands.
- Hierarchical Cargo features (mirrors the existing `all-languages` umbrella):

```toml
[features]
vcs     = ["vcs-git"]                                # umbrella; future: ["vcs-git", "vcs-hg", "vcs-jj"]
vcs-git = ["dep:gix", "dep:bstr", "dep:regex"]       # leaf
```

CLI/web/py crates list `"vcs"` in their default features so end-user binaries pick up every backend that ships.
- `gix` feature set: `["max-performance-safe", "blob-diff", "mailmap", "revision", "index"]`.
- `build_history_index` runs ONCE per invocation (before the AST walk) and produces `HashMap<repo-relative-path, FileStats>`. Walking history per file would be catastrophic on large repos.
- `CodeMetrics` (in `src/spaces.rs`) gains `pub vcs: Option<vcs::Stats>` and a `Vcs` variant in `Metric` (mark `#[non_exhaustive]` if not already).
- New `bca vcs` subcommand mirrors the prototype's ranked-list output. Integration into `bca metrics`/`check`/`report` via `--metrics vcs` is also wired up. The subcommand is backend-agnostic; it probes the working tree to decide which backend to use.
- New `POST /vcs` endpoint on `bca-web`; new `vcs_metrics(...)` on Python bindings; opt-in `vcs=True` parameter on the existing Python `analyze()`.

## CLI surface

`bca vcs` flags:

- `--long-window <DURATION>` (default `12mo`)
- `--recent-window <DURATION>` (default `90d`)
- `--top <N>` (default `50`; `0` = all)
- `--ref <REF>` (default `HEAD`)
- `--full-history` (default: first-parent only)
- `--include-merges` (default: skip merges)
- `--no-follow-renames` (default: follow)
- `--no-exclude-bots`, `--bot-pattern <REGEX>` (default exclude: `dependabot[bot]`, `renovate[bot]`, `github-actions[bot]`, `pre-commit-ci[bot]`, `mergify[bot]`, `pyup-bot`)
- `--as-of <RFC3339>` (default: wall clock) — for reproducible snapshots
- `--risk-formula {weighted|percentile}` (default: `weighted`)
- `--emit-author-details` (default: off; opts into SHA-256-hashed canonical author IDs)
- Reuses global `--paths` / `--include` / `--exclude` / `--exclude-tests` / `--no-ignore`
- Reuses global output-format flags (JSON / YAML / TOML / CSV)

When the input path is not under a working tree of a supported VCS, `bca vcs` errors clearly; `bca metrics --metrics vcs` succeeds with a one-shot warning and omits the `vcs` field per file.

## Edge cases the implementation must handle

- `.mailmap` respected; multiple author emails canonicalized to one identity
- `Co-authored-by:` trailers parsed and counted
- Bot identities filtered by default; configurable regex
- First-parent history by default; `--full-history` opts into full DAG
- Merge commits skipped by default; `--include-merges` to include
- File rename detection on by default
- Shallow clones detected; output flag `truncated_shallow_clone: true` and a warning
- Bare repos and worktrees both supported via `gix::open`
- Submodules NOT recursed into; documented as out-of-scope
- Binary files skipped (numstat reports `-`)
- Symlinks skipped
- Deleted files skipped by default; `--include-deleted` opt-in
- Untracked / gitignored files: `vcs` field is `None`, distinct from a tracked file with zero counts in window
- Window units: `12mo`, `90d`, `2y`, `8w`, or ISO 8601 `P12M`
- Window inclusive boundary at `now - window`
- Future-dated commits (clock skew) clamped to `now()`
- All time math in UTC; `--as-of <RFC3339>` for reproducible runs
- Author emails never emitted by default; `--emit-author-details` opts into SHA-256 hashed canonical IDs
- All path handling via `bstr::BString`; UTF-8 conversion only at output boundary with explicit error handling (per AGENTS.md path rules)
- No `unsafe`; no `unwrap`/`expect`/`panic!` in non-test code; all `gix` errors mapped to typed `vcs::Error`
- `Metric` enum marked `#[non_exhaustive]` so future variants don't break consumers

## Composite risk-score formula (v1)

Log-scaled weighted sum, plus categorical multiplicative bumps:

```
recency_churn  = ln(1 + churn_recent)
long_churn     = ln(1 + churn_long)
recency_count  = ln(1 + commits_recent)
long_count     = ln(1 + commits_long)
author_factor  = ln(1 + authors_long)
dilution       = (1 - ownership_top_share).clamp(0.0, 1.0)
fix_factor     = ln(1 + bug_fix_commits + 2 * security_fix_commits)
size_factor    = ln(1 + sloc).powi(2) / 100.0    // tiny tie-breaker
new_file_bonus = if age_days < recent_window_days { 0.15 } else { 0 }
dev_bonus      = if authors_long >= 9 { 0.35 }
                 else if authors_long >= 6 { 0.15 }
                 else { 0 }

base = 0.30 * recency_churn
     + 0.25 * recency_count
     + 0.15 * long_count
     + 0.15 * author_factor * (1.0 + dilution)
     + 0.10 * fix_factor
     + 0.05 * long_churn
     + size_factor

risk_score = base * (1.0 + dev_bonus + new_file_bonus)
risk_score_version = 1
```

Documented in `src/vcs/score.rs` with full citations. Score is **ordinal, not cardinal**: only relative ranks have meaning.

## Output shape (additive to `CodeMetrics`)

```json
{
  "name": "src/foo.rs",
  "metrics": {
    "loc": { "...": "..." },
    "cyclomatic": { "...": "..." },
    "vcs": {
      "vcs_schema_version": 1,
      "risk_score_version": 1,
      "long_window_days": 365,
      "recent_window_days": 90,
      "commits_long": 42, "commits_recent": 11,
      "churn_long": 2150, "churn_recent": 480,
      "authors_long": 7, "authors_recent": 3,
      "ownership_top_share": 0.41,
      "burst": 0.26,
      "bug_fix_commits": 9,
      "security_fix_commits": 2,
      "revert_commits": 0,
      "age_days": 540,
      "last_modified_days": 7,
      "risk_score": 187.3,
      "hotspot_score": 423.1
    }
  }
}
```

Adding fields to `CodeMetrics` is backwards-compatible (`serde` makes additive changes safe; #253 confirms this). The `Metric` enum gains one variant — confirm `#[non_exhaustive]` so future additions are non-breaking.

## Test strategy

- Central helper `tests/common/vcs_fixture.rs` builds deterministic temp git repos via `gix` with fixed author identities and UNIX timestamps.
- Per-signal unit tests assert exact integer counts against known fixtures: empty repo, single commit, two-author file, bot-excluded vs included, mailmap-canonicalized author, Co-authored-by, renamed file (with and without `--follow-renames`), exact window boundary (inclusive at `now - window`), keyword classification (bug/security/revert positive + false-positive avoidance).
- Score-property tests use comparative assertions, not exact floats: high-churn beats low-churn, diluted-ownership beats concentrated, high author-count beats low, new-and-busy beats old-and-quiet.
- Integration tests under `tests/`:
  - `bca vcs --paths <fixture-repo>` → anchored JSON snapshot (per `.snapshot-anchor-baseline.txt` rules: `assert_eq!` on integer fields above each `insta::assert_json_snapshot!`).
  - `bca metrics --metrics cyclomatic,vcs --paths <fixture-repo>` → mixed-output snapshot.
  - `bca vcs` outside a git repo → non-zero exit with clear error.
  - `bca metrics --metrics vcs` outside a git repo → succeeds with warning and omitted `vcs` field.
- Defensive-refactor verification (per `.claude/rules/testing.md`): any tightening predicate gets a `git checkout HEAD~1` revert test to prove it would fail against the pre-refactor code.
- `cargo build` / `cargo test --workspace --all-features` / `cargo clippy ... -D warnings` all pass with and without the `vcs` feature.
- `make pre-commit` (full validation gate) clean before submission.
- Mutation testing of `src/vcs/` added to the quarterly cron in a separate, follow-up PR (out-of-band, not v1).

## Acceptance checklist

- [ ] `gix` integrated behind the **`vcs-git`** Cargo feature with the explicit feature list above; umbrella `vcs = ["vcs-git"]` registered at the workspace root
- [ ] All v1 signals implemented and unit-tested with deterministic synthetic git repos (fixed authors + UNIX timestamps)
- [ ] `bca vcs` subcommand reproduces the prototype's ranked-list output, plus configurable windows, history mode, merge handling, bot filtering, identity emission, and `--as-of`
- [ ] VCS fields available via `bca metrics`/`check`/`report` when `--metrics vcs` is selected
- [ ] `POST /vcs` endpoint, plus optional `repo_path` on `POST /metrics` for AST + VCS in one call
- [ ] `vcs_metrics()` Python function, plus opt-in `vcs=True` on `analyze()`
- [ ] `metrics/vcs.md` mdBook chapter and updates to `recipes/rest-api.md`
- [ ] Composite score formula documented in code AND in the book with explicit citations back to `vulnerability-correlation.md`
- [ ] `Metric` enum is `#[non_exhaustive]`
- [ ] All edge cases above handled with at least one test each
- [ ] Defensive-refactor verification (per `.claude/rules/testing.md`) for any tightening predicates
- [ ] Anchored snapshots per `.snapshot-anchor-baseline.txt` rules
- [ ] `make pre-commit` clean (full validation gate)
- [ ] Follow-up issues filed for per-function granularity, change entropy, JIT, bus factor, history trend, persistent cache, and additional VCS backends
- [ ] `git-history-risk-rank.sh` either deleted or reduced to a documented historical reference

## References

- `vulnerability-correlation.md` (this repo)
- Nagappan & Ball, "Use of Relative Code Churn Measures to Predict System Defect Density" (2005)
- Hassan, "Predicting Faults Using the Complexity of Code Changes" (2009, change entropy)
- Tornhill, "Your Code as a Crime Scene" (hotspots = complexity × churn)
- Kamei et al., systematic survey of just-in-time defect prediction (ACM Computing Surveys, 2022) — <https://damevski.github.io/files/report_CSUR_2022.pdf>
- Avelino et al., DoA-based truck-factor / bus-factor algorithms — <https://www.sciencedirect.com/science/article/pii/S0020025526002847>
- Co-Change Graph Entropy (ICEAS 2025) — <https://arxiv.org/abs/2504.18511>
- Comprehensive evaluation of SZZ variants (Rosa et al. 2023) — <https://www.inf.usi.ch/faculty/lanza/Downloads/Journals/Rosa2023a.pdf>
- OpenSSF Scorecard checks — <https://github.com/ossf/scorecard/blob/main/docs/checks.md> (adjacent repository-level signal set; not reimplemented)
- gitoxide / gix project — <https://github.com/GitoxideLabs/gitoxide>
- Gitmailmap docs — <https://git-scm.com/docs/gitmailmap>
- PySecDB — Exploring Security Commits in Python (ICSME 2023) — <https://shuwang.phd/papers/icsme23_PySecDB.pdf>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add change-history (VCS) metrics with a git backend — churn, authors, ownership, bug-fix / security-fix counts, composite risk score #328

Summary

Scope (v1)

Explicitly out of scope (filed as follow-ups)

Architecture

CLI surface

Edge cases the implementation must handle

Composite risk-score formula (v1)

Output shape (additive to `CodeMetrics`)

Test strategy

Acceptance checklist

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Field	Type	Description	Primary literature support
`commits_long`	u32	Distinct commits touching the file in the long window	Firefox NumChanges (PD 86); Vista edit-freq ρ ≈ 0.29
`commits_recent`	u32	Same, recent window	JIT defect prediction; Firefox/RHEL4
`churn_long`	u64	Σ(added + deleted lines) in long window	Nagappan & Ball relative-churn; Firefox LinesChanged (PD 85)
`churn_recent`	u64	Same, recent window	JIT defect prediction
`authors_long`	u32	Distinct canonical author identities in long window	RHEL4 (≥9 → 16×); Vista NumEngineers ρ ≈ 0.26
`authors_recent`	u32	Same, recent window	Same lineage
`ownership_top_share`	f64 ∈ [0,1]	Share of edits in long window attributable to the top author; lower = more diluted	Avelino DoA / truck-factor heuristic
`burst`	f64	`commits_recent / commits_long`, clamped to `[0, 1]`	Vista "repeat frequency" ρ ≈ 0.27
`bug_fix_commits`	u32	Long-window commits whose message matches a bug-fix keyword regex	Pascarella/Bavota commit-message classification
`security_fix_commits`	u32	Long-window commits matching security keywords (CVE-####, security, vuln, exploit, sanitize, etc.)	Sentence-Level VFC studies; PySecDB
`revert_commits`	u32	Long-window commits whose subject matches `^Revert` / `rollback`	Stability proxy
`age_days`	u32	Days since the file's first commit (capped at window)	Chromium "new features" risk
`last_modified_days`	u32	Days since the file's most recent commit	Operational filter / staleness
`risk_score`	f64	Composite, formula-versioned (see below)	Literature-derived; non-cardinal
`risk_score_version`	u32	Increments any time the formula changes	Forward-compatibility
`hotspot_score`	Option<f64>	`complexity_index × churn_recent`, present only when AST metrics are also computed	Nagappan & Ball; Tornhill
`vcs_schema_version`	u32	Output shape version	Forward-compatibility

feat(metrics): add change-history (VCS) metrics with a git backend — churn, authors, ownership, bug-fix / security-fix counts, composite risk score #328

Description

Summary

Scope (v1)

Explicitly out of scope (filed as follow-ups)

Architecture

CLI surface

Edge cases the implementation must handle

Composite risk-score formula (v1)

Output shape (additive to CodeMetrics)

Test strategy

Acceptance checklist

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Output shape (additive to `CodeMetrics`)