Follow-up to #328.
Add change entropy (Hassan, 2009) and co-change graph entropy (ICEAS 2025) as additional vcs::Stats fields.
Why
Both are well-validated process metrics with strong empirical evidence for defect prediction:
- Change entropy measures how distributed changes are across files in a commit. File-level Pearson correlation with defects reaches 0.54 on Apache projects.
- Co-change graph entropy (arXiv 2504.18511, 2025) models co-changes as a graph and quantifies co-change scattering. Combined with change entropy it improves AUROC in 82.5% of cases and MCC in 65%, with statistically significant gains on eight Apache projects.
These signals complement the v1 signal set (commits/churn/authors/ownership), which they were shown to outperform when combined.
Scope
- Compute change entropy per file over the long and recent windows.
- Build a co-change graph during the single history walk (cheap: record file pairs that appear in the same commit).
- Compute co-change graph entropy per file from that graph.
- Surface as
change_entropy_long, change_entropy_recent, cochange_entropy_long, cochange_entropy_recent.
- Feed into the composite
risk_score as a versioned formula bump (risk_score_version = 2).
Edge cases
- Files that have changed only in single-file commits have zero co-change entropy by definition — distinguish from "not computed."
- Co-change graphs grow O(commit_size²) in commit width; cap large initial-import commits (e.g., commits touching >1000 files are excluded from the graph).
- Memory: a sparse adjacency representation is mandatory on large repos.
Acceptance criteria
Follow-up to #328.
Add change entropy (Hassan, 2009) and co-change graph entropy (ICEAS 2025) as additional
vcs::Statsfields.Why
Both are well-validated process metrics with strong empirical evidence for defect prediction:
These signals complement the v1 signal set (commits/churn/authors/ownership), which they were shown to outperform when combined.
Scope
change_entropy_long,change_entropy_recent,cochange_entropy_long,cochange_entropy_recent.risk_scoreas a versioned formula bump (risk_score_version = 2).Edge cases
Acceptance criteria
vcs.change_entropy_*andvcs.cochange_entropy_*.risk_score_versionbumped to 2 with the new formula documented insrc/vcs/score.rsand the mdBook chapter.