Skip to content

feat(metrics): change entropy and co-change graph entropy #330

@dekobon

Description

@dekobon

Follow-up to #328.

Add change entropy (Hassan, 2009) and co-change graph entropy (ICEAS 2025) as additional vcs::Stats fields.

Why

Both are well-validated process metrics with strong empirical evidence for defect prediction:

  • Change entropy measures how distributed changes are across files in a commit. File-level Pearson correlation with defects reaches 0.54 on Apache projects.
  • Co-change graph entropy (arXiv 2504.18511, 2025) models co-changes as a graph and quantifies co-change scattering. Combined with change entropy it improves AUROC in 82.5% of cases and MCC in 65%, with statistically significant gains on eight Apache projects.

These signals complement the v1 signal set (commits/churn/authors/ownership), which they were shown to outperform when combined.

Scope

  • Compute change entropy per file over the long and recent windows.
  • Build a co-change graph during the single history walk (cheap: record file pairs that appear in the same commit).
  • Compute co-change graph entropy per file from that graph.
  • Surface as change_entropy_long, change_entropy_recent, cochange_entropy_long, cochange_entropy_recent.
  • Feed into the composite risk_score as a versioned formula bump (risk_score_version = 2).

Edge cases

  • Files that have changed only in single-file commits have zero co-change entropy by definition — distinguish from "not computed."
  • Co-change graphs grow O(commit_size²) in commit width; cap large initial-import commits (e.g., commits touching >1000 files are excluded from the graph).
  • Memory: a sparse adjacency representation is mandatory on large repos.

Acceptance criteria

  • Both entropy values emitted under vcs.change_entropy_* and vcs.cochange_entropy_*.
  • risk_score_version bumped to 2 with the new formula documented in src/vcs/score.rs and the mdBook chapter.
  • Unit tests on synthetic fixtures with known entropy values.
  • Performance regression test: entropy computation adds <20% to total VCS walk time on a 1k-commit fixture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions