Skip to content

feat(metrics): add change-history (VCS) metrics with a git backend — churn, authors, ownership, bug-fix / security-fix counts, composite risk score #328

@dekobon

Description

@dekobon

Summary

Add a new family of code metrics derived from version control history, as a peer to the existing AST-derived metrics. The goal is to surface files that are most likely to contain vulnerabilities OR bugs, using the signals the empirical literature most consistently backs.

Naming principle. vcs is a generic abstraction; git is the v1 backend. Future backends (Mercurial, Jujutsu, Pijul) are plausible and must not require renaming. Generic types live under the vcs namespace; git-specific code lives under vcs::git. Cargo features mirror this split (umbrella vcs = ["vcs-git"], exactly like the existing all-languages umbrella).

Motivation and evidence: the synthesis in vulnerability-correlation.md (in the repo root) plus a broader defect-prediction literature. Concrete published effect sizes include Firefox NumChanges PD 86 / PF 23, RHEL4 ≥9-developer files reported as ~16× more likely to harbor a vulnerability, Windows Vista edit-frequency ρ ≈ 0.29, Hassan's change entropy reaching Pearson 0.54 with file-level defects on Apache projects, and Nagappan & Ball's relative-churn measures forming the basis of Tornhill's well-known complexity × churn "hotspot" model.

A working shell-script prototype lives at git-history-risk-rank.sh; this issue replaces it with a first-class, tested, cross-platform Rust implementation and broadens the signal set.

This is the first metric family in the project that is language-agnostic and not AST-derived, so it also establishes the architectural pattern for any future non-AST signals.

Scope (v1)

v1 ships one backend (vcs-git), behind the umbrella vcs Cargo feature. Adding a second backend later is a separate issue and does not require renaming or moving v1 code.

A single history walk produces these per-file signals over two configurable time windows (defaults: 12 months "long", 90 days "recent"):

Field Type Description Primary literature support
commits_long u32 Distinct commits touching the file in the long window Firefox NumChanges (PD 86); Vista edit-freq ρ ≈ 0.29
commits_recent u32 Same, recent window JIT defect prediction; Firefox/RHEL4
churn_long u64 Σ(added + deleted lines) in long window Nagappan & Ball relative-churn; Firefox LinesChanged (PD 85)
churn_recent u64 Same, recent window JIT defect prediction
authors_long u32 Distinct canonical author identities in long window RHEL4 (≥9 → 16×); Vista NumEngineers ρ ≈ 0.26
authors_recent u32 Same, recent window Same lineage
ownership_top_share f64 ∈ [0,1] Share of edits in long window attributable to the top author; lower = more diluted Avelino DoA / truck-factor heuristic
burst f64 commits_recent / commits_long, clamped to [0, 1] Vista "repeat frequency" ρ ≈ 0.27
bug_fix_commits u32 Long-window commits whose message matches a bug-fix keyword regex Pascarella/Bavota commit-message classification
security_fix_commits u32 Long-window commits matching security keywords (CVE-####, security, vuln, exploit, sanitize, etc.) Sentence-Level VFC studies; PySecDB
revert_commits u32 Long-window commits whose subject matches ^Revert / rollback Stability proxy
age_days u32 Days since the file's first commit (capped at window) Chromium "new features" risk
last_modified_days u32 Days since the file's most recent commit Operational filter / staleness
risk_score f64 Composite, formula-versioned (see below) Literature-derived; non-cardinal
risk_score_version u32 Increments any time the formula changes Forward-compatibility
hotspot_score Option<f64> complexity_index × churn_recent, present only when AST metrics are also computed Nagappan & Ball; Tornhill
vcs_schema_version u32 Output shape version Forward-compatibility

The composite score uses log-scaling on every count, weights recent churn and recent commits highest, multiplies the author factor by the ownership-dilution factor (1 - ownership_top_share), treats file size as a tiny tie-breaker, and applies categorical multiplicative bumps for the RHEL4 6-developer (1.15×) and 9-developer (1.35×) thresholds plus a new-file bump (1.15× when age_days < recent_window_days). Bug-fix and security-fix commit counts feed in via a log-scaled additive term with double weight on security fixes. The exact formula is below and is also documented in src/vcs/score.rs with citations; risk_score_version lets it evolve without breaking downstream consumers.

An alternative percentile-based score is available via --risk-formula percentile: each signal is re-ranked to its percentile within the analyzed set, then averaged. The literature explicitly recommends relative/percentile triggers over hard thresholds for cross-project robustness.

Explicitly out of scope (filed as follow-ups)

  • Per-function granularity via git blame + AST line spans
  • Change entropy and co-change graph entropy (Hassan 2009; arXiv 2504.18511, 2025)
  • Just-in-time (commit-level) risk scoring (Kamei et al.)
  • Directory- and repo-level bus factor (Avelino DoA)
  • Full SZZ bug-inducing commit detection (developer-validated SZZ recall remains ≈0.55 even with LLM augmentation; out of scope for a metrics library)
  • Historical metric trend (time series over N historical points)
  • Persistent VCS history cache keyed by HEAD SHA
  • CVE / advisory linkage
  • Dependency graph integration
  • Submodule recursion
  • VCS backends other than git (Mercurial, Jujutsu, Pijul, …)

Architecture

  • New module tree under src/vcs/:
    • Generic (always compiled when any backend is enabled): error, options, stats, identity, classify, score, hotspot, and a build_history_index(root, options) entry point.
    • Backend-specific: src/vcs/git/ (repo, history, identity) — gated by vcs-git.
  • v1 does not introduce a Backend trait (premature abstraction with one backend). The top-level entry point delegates to the single available backend; the trait is extracted when a second backend lands.
  • Hierarchical Cargo features (mirrors the existing all-languages umbrella):
[features]
vcs     = ["vcs-git"]                                # umbrella; future: ["vcs-git", "vcs-hg", "vcs-jj"]
vcs-git = ["dep:gix", "dep:bstr", "dep:regex"]       # leaf

CLI/web/py crates list "vcs" in their default features so end-user binaries pick up every backend that ships.

  • gix feature set: ["max-performance-safe", "blob-diff", "mailmap", "revision", "index"].
  • build_history_index runs ONCE per invocation (before the AST walk) and produces HashMap<repo-relative-path, FileStats>. Walking history per file would be catastrophic on large repos.
  • CodeMetrics (in src/spaces.rs) gains pub vcs: Option<vcs::Stats> and a Vcs variant in Metric (mark #[non_exhaustive] if not already).
  • New bca vcs subcommand mirrors the prototype's ranked-list output. Integration into bca metrics/check/report via --metrics vcs is also wired up. The subcommand is backend-agnostic; it probes the working tree to decide which backend to use.
  • New POST /vcs endpoint on bca-web; new vcs_metrics(...) on Python bindings; opt-in vcs=True parameter on the existing Python analyze().

CLI surface

bca vcs flags:

  • --long-window <DURATION> (default 12mo)
  • --recent-window <DURATION> (default 90d)
  • --top <N> (default 50; 0 = all)
  • --ref <REF> (default HEAD)
  • --full-history (default: first-parent only)
  • --include-merges (default: skip merges)
  • --no-follow-renames (default: follow)
  • --no-exclude-bots, --bot-pattern <REGEX> (default exclude: dependabot[bot], renovate[bot], github-actions[bot], pre-commit-ci[bot], mergify[bot], pyup-bot)
  • --as-of <RFC3339> (default: wall clock) — for reproducible snapshots
  • --risk-formula {weighted|percentile} (default: weighted)
  • --emit-author-details (default: off; opts into SHA-256-hashed canonical author IDs)
  • Reuses global --paths / --include / --exclude / --exclude-tests / --no-ignore
  • Reuses global output-format flags (JSON / YAML / TOML / CSV)

When the input path is not under a working tree of a supported VCS, bca vcs errors clearly; bca metrics --metrics vcs succeeds with a one-shot warning and omits the vcs field per file.

Edge cases the implementation must handle

  • .mailmap respected; multiple author emails canonicalized to one identity
  • Co-authored-by: trailers parsed and counted
  • Bot identities filtered by default; configurable regex
  • First-parent history by default; --full-history opts into full DAG
  • Merge commits skipped by default; --include-merges to include
  • File rename detection on by default
  • Shallow clones detected; output flag truncated_shallow_clone: true and a warning
  • Bare repos and worktrees both supported via gix::open
  • Submodules NOT recursed into; documented as out-of-scope
  • Binary files skipped (numstat reports -)
  • Symlinks skipped
  • Deleted files skipped by default; --include-deleted opt-in
  • Untracked / gitignored files: vcs field is None, distinct from a tracked file with zero counts in window
  • Window units: 12mo, 90d, 2y, 8w, or ISO 8601 P12M
  • Window inclusive boundary at now - window
  • Future-dated commits (clock skew) clamped to now()
  • All time math in UTC; --as-of <RFC3339> for reproducible runs
  • Author emails never emitted by default; --emit-author-details opts into SHA-256 hashed canonical IDs
  • All path handling via bstr::BString; UTF-8 conversion only at output boundary with explicit error handling (per AGENTS.md path rules)
  • No unsafe; no unwrap/expect/panic! in non-test code; all gix errors mapped to typed vcs::Error
  • Metric enum marked #[non_exhaustive] so future variants don't break consumers

Composite risk-score formula (v1)

Log-scaled weighted sum, plus categorical multiplicative bumps:

recency_churn  = ln(1 + churn_recent)
long_churn     = ln(1 + churn_long)
recency_count  = ln(1 + commits_recent)
long_count     = ln(1 + commits_long)
author_factor  = ln(1 + authors_long)
dilution       = (1 - ownership_top_share).clamp(0.0, 1.0)
fix_factor     = ln(1 + bug_fix_commits + 2 * security_fix_commits)
size_factor    = ln(1 + sloc).powi(2) / 100.0    // tiny tie-breaker
new_file_bonus = if age_days < recent_window_days { 0.15 } else { 0 }
dev_bonus      = if authors_long >= 9 { 0.35 }
                 else if authors_long >= 6 { 0.15 }
                 else { 0 }

base = 0.30 * recency_churn
     + 0.25 * recency_count
     + 0.15 * long_count
     + 0.15 * author_factor * (1.0 + dilution)
     + 0.10 * fix_factor
     + 0.05 * long_churn
     + size_factor

risk_score = base * (1.0 + dev_bonus + new_file_bonus)
risk_score_version = 1

Documented in src/vcs/score.rs with full citations. Score is ordinal, not cardinal: only relative ranks have meaning.

Output shape (additive to CodeMetrics)

{
  "name": "src/foo.rs",
  "metrics": {
    "loc": { "...": "..." },
    "cyclomatic": { "...": "..." },
    "vcs": {
      "vcs_schema_version": 1,
      "risk_score_version": 1,
      "long_window_days": 365,
      "recent_window_days": 90,
      "commits_long": 42, "commits_recent": 11,
      "churn_long": 2150, "churn_recent": 480,
      "authors_long": 7, "authors_recent": 3,
      "ownership_top_share": 0.41,
      "burst": 0.26,
      "bug_fix_commits": 9,
      "security_fix_commits": 2,
      "revert_commits": 0,
      "age_days": 540,
      "last_modified_days": 7,
      "risk_score": 187.3,
      "hotspot_score": 423.1
    }
  }
}

Adding fields to CodeMetrics is backwards-compatible (serde makes additive changes safe; #253 confirms this). The Metric enum gains one variant — confirm #[non_exhaustive] so future additions are non-breaking.

Test strategy

  • Central helper tests/common/vcs_fixture.rs builds deterministic temp git repos via gix with fixed author identities and UNIX timestamps.
  • Per-signal unit tests assert exact integer counts against known fixtures: empty repo, single commit, two-author file, bot-excluded vs included, mailmap-canonicalized author, Co-authored-by, renamed file (with and without --follow-renames), exact window boundary (inclusive at now - window), keyword classification (bug/security/revert positive + false-positive avoidance).
  • Score-property tests use comparative assertions, not exact floats: high-churn beats low-churn, diluted-ownership beats concentrated, high author-count beats low, new-and-busy beats old-and-quiet.
  • Integration tests under tests/:
    • bca vcs --paths <fixture-repo> → anchored JSON snapshot (per .snapshot-anchor-baseline.txt rules: assert_eq! on integer fields above each insta::assert_json_snapshot!).
    • bca metrics --metrics cyclomatic,vcs --paths <fixture-repo> → mixed-output snapshot.
    • bca vcs outside a git repo → non-zero exit with clear error.
    • bca metrics --metrics vcs outside a git repo → succeeds with warning and omitted vcs field.
  • Defensive-refactor verification (per .claude/rules/testing.md): any tightening predicate gets a git checkout HEAD~1 revert test to prove it would fail against the pre-refactor code.
  • cargo build / cargo test --workspace --all-features / cargo clippy ... -D warnings all pass with and without the vcs feature.
  • make pre-commit (full validation gate) clean before submission.
  • Mutation testing of src/vcs/ added to the quarterly cron in a separate, follow-up PR (out-of-band, not v1).

Acceptance checklist

  • gix integrated behind the vcs-git Cargo feature with the explicit feature list above; umbrella vcs = ["vcs-git"] registered at the workspace root
  • All v1 signals implemented and unit-tested with deterministic synthetic git repos (fixed authors + UNIX timestamps)
  • bca vcs subcommand reproduces the prototype's ranked-list output, plus configurable windows, history mode, merge handling, bot filtering, identity emission, and --as-of
  • VCS fields available via bca metrics/check/report when --metrics vcs is selected
  • POST /vcs endpoint, plus optional repo_path on POST /metrics for AST + VCS in one call
  • vcs_metrics() Python function, plus opt-in vcs=True on analyze()
  • metrics/vcs.md mdBook chapter and updates to recipes/rest-api.md
  • Composite score formula documented in code AND in the book with explicit citations back to vulnerability-correlation.md
  • Metric enum is #[non_exhaustive]
  • All edge cases above handled with at least one test each
  • Defensive-refactor verification (per .claude/rules/testing.md) for any tightening predicates
  • Anchored snapshots per .snapshot-anchor-baseline.txt rules
  • make pre-commit clean (full validation gate)
  • Follow-up issues filed for per-function granularity, change entropy, JIT, bus factor, history trend, persistent cache, and additional VCS backends
  • git-history-risk-rank.sh either deleted or reduced to a documented historical reference

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions