Skip to content

Commit 7e127fb

Browse files
igerberclaude
andcommitted
Resolve merge conflict, match R colMeans convention in panel IPW/DR M2 gradients
Resolve TODO.md merge conflict (keep our resolved items from Phase 7). Panel IPW/DR PS nuisance gradients (M2, M2_dr): change np.mean() over control rows to np.sum() / (n_t + n_c), matching R's colMeans() over ALL n observations (zero for treated). This makes the panel paths consistent with the RC paths' R-convention refactoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 parents cb3f815 + eacd38c commit 7e127fb

15 files changed

Lines changed: 2133 additions & 30 deletions

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,34 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [2.7.6] - 2026-03-28
11+
12+
### Added
13+
- **AI practitioner guardrails** based on Baker et al. (2025) "Difference-in-Differences Designs: A Practitioner's Guide"
14+
- `practitioner.py` module with 8-step workflow enforcement for AI agents
15+
- Estimator-specific handlers ensuring correct diagnostic ordering (pre-trends before estimation, Bacon decomposition before estimator selection)
16+
- `docs/llms.txt`, `docs/llms-practitioner.txt`, `docs/llms-full.txt` for AI agent discoverability
17+
- Evaluation rubric (`docs/practitioner-guide-evaluation.md`) with correctness-aware scoring
18+
- **Survey Phase 6: Advanced features**
19+
- Survey-aware bootstrap for all 8 bootstrap-using estimators (PSU-level multiplier for CS/Imputation/TwoStage/Continuous/Efficient; Rao-Wu rescaled for SA/SyntheticDiD/TROP)
20+
- Replicate weight variance estimation (BRR, Fay's BRR, JK1, JKn) for OLS-based and IF-based estimators
21+
- Per-coefficient DEFF diagnostics comparing survey vs SRS variance
22+
- Subpopulation analysis via `SurveyDesign.subpopulation()` preserving full design structure
23+
- CS analytical expansion: strata/PSU/FPC for aggregated SEs via `compute_survey_if_variance()`
24+
- TROP cross-classified pseudo-strata for survey-aware bootstrap
25+
26+
### Changed
27+
- Estimator-specific guidance for parallel trends tests and placebo checks (no shared templates)
28+
- SDiD and TROP split into separate decision tree branches in practitioner workflow
29+
30+
### Fixed
31+
- Fix replicate weight df calculation using pivoted QR rank with R-compatible tolerance
32+
- Fix replicate IF variance score scaling for EfficientDiD, TripleDiff, ContinuousDiD
33+
- Fix panel-to-unit replicate weight propagation and normalization
34+
- Fix CS zero-mass return type and vectorized guard for survey paths
35+
- Fix `solve_logit` effective-sample validation for zero-weight designs
36+
- Fix subpopulation mask validation and EfficientDiD bootstrap guard
37+
1038
## [2.7.5] - 2026-03-23
1139

1240
### Added
@@ -978,6 +1006,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
9781006
- `to_dict()` and `to_dataframe()` export methods
9791007
- `is_significant` and `significance_stars` properties
9801008

1009+
[2.7.6]: https://github.com/igerber/diff-diff/compare/v2.7.5...v2.7.6
9811010
[2.7.5]: https://github.com/igerber/diff-diff/compare/v2.7.4...v2.7.5
9821011
[2.7.4]: https://github.com/igerber/diff-diff/compare/v2.7.3...v2.7.4
9831012
[2.7.3]: https://github.com/igerber/diff-diff/compare/v2.7.2...v2.7.3

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,7 @@ category (`Methodology/Correctness`, `Performance`, or `Testing/Docs`):
138138
| `CONTRIBUTING.md` | Documentation requirements, test writing guidelines |
139139
| `.claude/commands/dev-checklists.md` | Checklists for params, methodology, warnings, reviews, bugs (run `/dev-checklists`) |
140140
| `.claude/memory.md` | Debugging patterns, tolerances, API conventions (git-tracked) |
141+
| `docs/llms-practitioner.txt` | Baker et al. (2025) 8-step practitioner workflow for AI agents |
141142
| `docs/performance-plan.md` | Performance optimization details |
142143
| `docs/benchmarks.rst` | Validation results vs R |
143144

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,14 @@ Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
6767
======================================================================
6868
```
6969

70+
## For AI Agents
71+
72+
If you are an AI agent or LLM using this library, read [`docs/llms.txt`](docs/llms.txt) for a concise API reference with an 8-step practitioner workflow (based on Baker et al. 2025). The workflow ensures rigorous DiD analysis — not just calling `fit()`, but testing assumptions, running sensitivity analysis, and checking robustness.
73+
74+
After estimation, call `practitioner_next_steps(results)` for context-aware guidance on remaining diagnostic steps.
75+
76+
Detailed guide: [`docs/llms-practitioner.txt`](docs/llms-practitioner.txt)
77+
7078
## Features
7179

7280
- **sklearn-like API**: Familiar `fit()` interface with `get_params()` and `set_params()`

TODO.md

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -19,25 +19,28 @@ Current limitations that may affect users:
1919

2020
### Large Module Files
2121

22-
Target: < 1000 lines per module for maintainability.
22+
Target: < 1000 lines per module for maintainability. Updated 2026-03-29.
2323

2424
| File | Lines | Action |
2525
|------|-------|--------|
26-
| `trop.py` | 862 | Split done — trop_global.py (990), trop_local.py (1000) |
27-
| `utils.py` | 1838 | Monitor |
28-
| `staggered.py` | 1785 | Monitor |
29-
| `imputation.py` | 1756 | Monitor |
30-
| `visualization.py` | 1727 | Monitor — growing but cohesive |
31-
| `linalg.py` | 1727 | Monitor — unified backend, splitting would hurt cohesion |
32-
| `triple_diff.py` | 1581 | Monitor |
26+
| `power.py` | 2588 | Consider splitting (power analysis + MDE + sample size) |
27+
| `linalg.py` | 2289 | Monitor — unified backend, splitting would hurt cohesion |
28+
| `staggered.py` | 2275 | Monitor — grew with survey support |
29+
| `imputation.py` | 2009 | Monitor |
30+
| `triple_diff.py` | 1921 | Monitor |
31+
| `utils.py` | 1902 | Monitor |
32+
| `two_stage.py` | 1708 | Monitor |
33+
| `survey.py` | 1646 | Monitor — grew with Phase 6 features |
34+
| `continuous_did.py` | 1626 | Monitor |
3335
| `honest_did.py` | 1511 | Acceptable |
34-
| `two_stage.py` | 1451 | Acceptable |
35-
| `power.py` | 1350 | Acceptable |
36-
| `prep.py` | 1242 | Acceptable |
37-
| `sun_abraham.py` | 1162 | Acceptable |
38-
| `continuous_did.py` | 1155 | Acceptable |
39-
| `estimators.py` | 1147 | Acceptable |
40-
| `pretrends.py` | 1104 | Acceptable |
36+
| `sun_abraham.py` | 1540 | Acceptable |
37+
| `estimators.py` | 1357 | Acceptable |
38+
| `trop_local.py` | 1261 | Acceptable |
39+
| `trop_global.py` | 1251 | Acceptable |
40+
| `prep.py` | 1225 | Acceptable |
41+
| `pretrends.py` | 1105 | Acceptable |
42+
| `trop.py` | 981 | Split done — trop_global.py + trop_local.py |
43+
| `visualization/` | 4172 | Subpackage (split across 7 files) — OK |
4144

4245
---
4346

@@ -49,8 +52,7 @@ Deferred items from PR reviews that were not addressed before merge.
4952

5053
| Issue | Location | PR | Priority |
5154
|-------|----------|----|----------|
52-
| ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails; fixing requires sparse least-squares alternatives) |
53-
| EfficientDiD: API docs / tutorial page for new public estimator | `docs/` | #192 | Medium |
55+
| ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails) |
5456
| Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium |
5557
| Replicate-weight survey df — **Resolved**. `df_survey = rank(replicate_weights) - 1` matching R's `survey::degf()`. For IF paths, `n_valid - 1` when dropped replicates reduce effective count. | `survey.py` | #238 | Resolved |
5658
| CallawaySantAnna survey: strata/PSU/FPC — **Resolved**. Aggregated SEs (overall, event study, group) use `compute_survey_if_variance()`. Bootstrap uses PSU-level multiplier weights. | `staggered.py` | #237 | Resolved |
@@ -75,11 +77,11 @@ Deferred items from PR reviews that were not addressed before merge.
7577

7678
| Issue | Location | PR | Priority |
7779
|-------|----------|----|----------|
80+
| Plotly renderers silently ignore styling kwargs (marker, markersize, linewidth, capsize, ci_linewidth) that the matplotlib backend honors; thread them through or reject when `backend="plotly"` | `visualization/_event_study.py`, `_diagnostic.py`, `_power.py` | #222 | Medium |
7881
| R comparison tests spawn separate `Rscript` per test (slow CI) | `tests/test_methodology_twfe.py:294` | #139 | Low |
7982
| CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path | `tests/test_methodology_callaway.py` | #202 | Low |
80-
| ~376 `duplicate object description` Sphinx warnings — caused by autodoc `:members:` on dataclass attributes within manual API pages (not from autosummary stubs); fix requires restructuring `docs/api/*.rst` pages to avoid documenting the same attribute via both `:members:` and inline `autosummary` tables | `docs/api/*.rst` || Low |
81-
| Plotly renderers silently ignore styling kwargs (marker, markersize, linewidth, capsize, ci_linewidth) that the matplotlib backend honors; thread them through or reject when `backend="plotly"` | `visualization/_event_study.py`, `_diagnostic.py`, `_power.py` | #222 | Medium |
82-
| Survey bootstrap test coverage — **Resolved**. Added FPC census zero-variance, single-PSU NaN, full-design bootstrap for CS/ContinuousDiD/EfficientDiD, and TROP Rao-Wu vs block bootstrap equivalence tests. | `tests/test_survey_phase*.py` || Resolved |
83+
| ~376 `duplicate object description` Sphinx warnings — restructure `docs/api/*.rst` to avoid duplicate `:members:` + `autosummary` | `docs/api/*.rst` || Low |
84+
| Doc-snippet smoke tests only cover `.rst` files; `.txt` AI guides outside CI validation | `tests/test_doc_snippets.py` | #239 | Low |
8385

8486
---
8587

diff_diff/__init__.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,13 @@
33
44
This library provides sklearn-like estimators for causal inference
55
using the difference-in-differences methodology.
6+
7+
For rigorous analysis, follow the 8-step practitioner workflow in
8+
docs/llms-practitioner.txt (based on Baker et al. 2025). After
9+
estimation, call ``practitioner_next_steps(results)`` for context-aware
10+
guidance on remaining diagnostic steps.
11+
12+
AI agent reference: docs/llms.txt
613
"""
714

815
# Import backend detection from dedicated module (avoids circular imports)
@@ -168,6 +175,7 @@
168175
plot_staircase,
169176
plot_synth_weights,
170177
)
178+
from diff_diff.practitioner import practitioner_next_steps
171179
from diff_diff.datasets import (
172180
clear_cache,
173181
list_datasets,
@@ -193,7 +201,7 @@
193201
Bacon = BaconDecomposition
194202
EDiD = EfficientDiD
195203

196-
__version__ = "2.7.5"
204+
__version__ = "2.7.6"
197205
__all__ = [
198206
# Estimators
199207
"DifferenceInDifferences",
@@ -344,4 +352,6 @@
344352
"load_dataset",
345353
"list_datasets",
346354
"clear_cache",
355+
# Practitioner guidance
356+
"practitioner_next_steps",
347357
]

0 commit comments

Comments
 (0)