Development TODO

Internal tracking for technical debt, known limitations, and maintenance tasks.

For the public feature roadmap, see ROADMAP.md.

Known Limitations

Current limitations that may affect users:

Issue	Location	Priority	Notes
MultiPeriodDiD wild bootstrap not supported	`estimators.py:778-784`	Low	Edge case
`predict()` raises NotImplementedError	`estimators.py:567-588`	Low	Rarely needed

For survey-specific limitations (NotImplementedError paths), see the Current Limitations section of survey-roadmap.md.

Code Quality

Large Module Files

Target: < 1000 lines per module for maintainability. Updated 2026-03-29.

File	Lines	Action
`power.py`	2588	Consider splitting (power analysis + MDE + sample size)
`linalg.py`	2289	Monitor — unified backend, splitting would hurt cohesion
`staggered.py`	2275	Monitor — grew with survey support
`imputation.py`	2009	Monitor
`triple_diff.py`	1921	Monitor
`utils.py`	1902	Monitor
`two_stage.py`	1708	Monitor
`survey.py`	1646	Monitor — grew with Phase 6 features
`continuous_did.py`	1626	Monitor
`honest_did.py`	1511	Acceptable
`sun_abraham.py`	1540	Acceptable
`estimators.py`	1357	Acceptable
`trop_local.py`	1261	Acceptable
`trop_global.py`	1251	Acceptable
`prep.py`	1225	Acceptable
`pretrends.py`	1105	Acceptable
`trop.py`	981	Split done — trop_global.py + trop_local.py
`visualization/`	4172	Subpackage (split across 7 files) — OK

Tech Debt from Code Reviews

Deferred items from PR reviews that were not addressed before merge.

Methodology/Correctness

Issue	Location	PR	Priority
dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE.	`chaisemartin_dhaultfoeuille.py`	#294	Low
dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note.	`chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md`	PR 2	Medium
dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting).	`test_chaisemartin_dhaultfoeuille_parity.py`	#294	Low
CallawaySantAnna: consider materializing NaN entries for non-estimable (g,t) cells in group_time_effects dict (currently omitted with consolidated warning); would require updating downstream consumers (event study, balance_e, aggregation)	`staggered.py`	#256	Low
ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels	`imputation.py`	#141	Medium (deferred — only triggers when sparse solver fails)
Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels)	`estimators.py`	#218	Medium
EfficientDiD `control_group="last_cohort"` trims at `last_g - anticipation` but REGISTRY says `t >= last_g`. With `anticipation=0` (default) these are identical. With `anticipation>0`, code is arguably more conservative (excludes anticipation-contaminated periods). Either align REGISTRY with code or change code to `t < last_g` — needs design decision.	`efficient_did.py`	#230	Low

| TripleDifference power: generate_ddd_data is a fixed 2×2×2 cross-sectional DGP — no multi-period or unbalanced-group support. Add a generate_ddd_panel_data for panel DDD power analysis. | prep_dgp.py, power.py | #208 | Low | | Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation | continuous_did.py, efficient_did.py, stacked_did.py | #226 | Low | | Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — _silverman_bandwidth() uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement | efficient_did_covariates.py | — | Low | | TROP: fit() and _fit_global() share ~150 lines of near-identical data setup (panel pivoting, absorbing-state validation, first-treatment detection, effective rank, NaN warnings). Both bootstrap methods also duplicate the stratified resampling loop. Extract shared helpers to eliminate cross-file sync risk. | trop.py, trop_global.py, trop_local.py | — | Low | | StaggeredTripleDifference R cross-validation: CSV fixtures not committed (gitignored); tests skip without local R + triplediff. Commit fixtures or generate deterministically. | tests/test_methodology_staggered_triple_diff.py | #245 | Medium | | StaggeredTripleDifference R parity: benchmark only tests no-covariate path (xformla=~1). Add covariate-adjusted scenarios and aggregation SE parity assertions. | benchmarks/R/benchmark_staggered_triplediff.R | #245 | Medium | | StaggeredTripleDifference: per-cohort group-effect SEs include WIF (conservative vs R's wif=NULL). Documented in REGISTRY. Could override mixin for exact R match. | staggered_triple_diff.py | #245 | Low | | HonestDiD Delta^RM: uses naive FLCI instead of paper's ARP conditional/hybrid confidence sets (Sections 3.2.1-3.2.2). ARP infrastructure exists but moment inequality transformation needs calibration. CIs are conservative (wider, valid coverage). | honest_did.py | #248 | Medium | | Replicate weight tests use Fay-like BRR perturbations (0.5/1.5), not true half-sample BRR. Add true BRR regressions per estimator family. Existing test_survey_phase6.py covers true BRR at the helper level. | tests/test_replicate_weight_expansion.py | #253 | Low | | WooldridgeDiD: QMLE sandwich uses aweight cluster-robust adjustment (G/(G-1))*(n-1)/(n-k) vs Stata's G/(G-1) only. Conservative (inflates SEs). Add qmle weight type if Stata golden values confirm material difference. | wooldridge.py, linalg.py | #216 | Medium | | WooldridgeDiD: aggregation weights use cell-level n_{g,t} counts. Paper (W2025 Eqs. 7.2-7.4) defines cohort-share weights. Add optional weights="cohort_share" parameter to aggregate(). | wooldridge_results.py | #216 | Medium | | WooldridgeDiD: canonical link requirement (W2023 Prop 3.1) not enforced — no warning if user applies wrong method to outcome type. Estimator is consistent regardless, but equivalence with imputation breaks. | wooldridge.py | #216 | Low | | WooldridgeDiD: Stata jwdid golden value tests — add R/Stata reference script and TestReferenceValues class. | tests/test_wooldridge.py | #216 | Medium | | Thread vcov_type (classical / hc1 / hc2 / hc2_bm) through the 8 standalone estimators that expose cluster=: CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, TripleDifference, StackedDiD, WooldridgeDiD, EfficientDiD. Phase 1a added vcov_type to the DifferenceInDifferences inheritance chain only. | multiple | Phase 1a | Medium | | Weighted one-way Bell-McCaffrey (vcov_type="hc2_bm" + weights, no cluster) currently raises NotImplementedError. _compute_bm_dof_from_contrasts builds its hat matrix from the unscaled design via X (X'WX)^{-1} X' W, but solve_ols solves the WLS problem by transforming to X* = sqrt(w) X, so the correct symmetric idempotent residual-maker is M* = I - sqrt(W) X (X'WX)^{-1} X' sqrt(W). Rederive the Satterthwaite (tr G)^2 / tr(G^2) ratio on the transformed design and add weighted parity tests before lifting the guard. | linalg.py::_compute_bm_dof_from_contrasts, linalg.py::_validate_vcov_args | Phase 1a | Medium | | HC2 / HC2 + Bell-McCaffrey on absorbed-FE fits currently raises NotImplementedError in three places: TwoWayFixedEffects unconditionally; DifferenceInDifferences(absorb=..., vcov_type in {"hc2","hc2_bm"}); MultiPeriodDiD(absorb=..., vcov_type in {"hc2","hc2_bm"}). Within-transformation preserves coefficients and residuals under FWL but not the hat matrix, so the reduced-design h_ii is not the diagonal of the full FE projection and CR2's block adjustment A_g = (I - H_gg)^{-1/2} is likewise wrong on absorbed cluster blocks. Lifting the guard needs HC2/CR2-BM computed from the full absorbed projection (unit/time FE dummies reconstructed internally, or a FE-aware hat-matrix formulation) and a parity harness against a full-dummy OLS run or R fixest/clubSandwich. HC1/CR1 are unaffected by this because they have no leverage term. | twfe.py::fit, estimators.py::DifferenceInDifferences.fit, estimators.py::MultiPeriodDiD.fit | Phase 1a | Medium | | Weighted CR2 Bell-McCaffrey cluster-robust (vcov_type="hc2_bm" + cluster_ids + weights) currently raises NotImplementedError. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | linalg.py::_compute_cr2_bm | Phase 1a | Medium | | Regenerate benchmarks/data/clubsandwich_cr2_golden.json from R (Rscript benchmarks/R/generate_clubsandwich_golden.R). Current JSON has source: python_self_reference as a stability anchor until an authoritative R run. | benchmarks/R/generate_clubsandwich_golden.R | Phase 1a | Medium | | honest_did.py:1907 np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | honest_did.py | #334 | Low | | TROP Rust vs Python bootstrap SE divergence under fixed seed: seed=42 on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (rand crate) while Python uses numpy.random.default_rng; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. @pytest.mark.xfail(strict=True) in tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | trop_global.py, rust/ | follow-up | Medium | | bias_corrected_local_linear: extend golden parity to kernel="triangular" and kernel="uniform" (currently epa-only; all three kernels share kernel_W and the lprobust math, so parity is expected but not separately asserted). | benchmarks/R/generate_nprobust_lprobust_golden.R, tests/test_bias_corrected_lprobust.py | Phase 1c | Low | | bias_corrected_local_linear: expose vce in {"hc0", "hc1", "hc2", "hc3"} on the public wrapper once R parity goldens exist (currently raises NotImplementedError). The port-level lprobust and lprobust_res already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit hii for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | diff_diff/local_linear.py::bias_corrected_local_linear, benchmarks/R/generate_nprobust_lprobust_golden.R, tests/test_bias_corrected_lprobust.py | Phase 1c | Medium | | bias_corrected_local_linear: support weights= once survey-design adaptation lands. nprobust's lprobust has no weight argument so there is no parity anchor; derivation needed. | diff_diff/local_linear.py, diff_diff/_nprobust_port.py::lprobust | Phase 1c | Medium | | bias_corrected_local_linear: support multi-eval grid (neval > 1) with cross-covariance (covgrid=TRUE branch of lprobust.R:253-378). Not needed for HAD but useful for multi-dose diagnostics. | diff_diff/_nprobust_port.py::lprobust | Phase 1c | Low | | Clustered-DGP parity: Phase 1c's DGP 4 uses manual h=b=0.3 to sidestep an nprobust-internal singleton-cluster bug in lpbwselect.mse.dpi's pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test. | benchmarks/R/generate_nprobust_lprobust_golden.R | Phase 1c | Low |

Performance

Issue	Location	PR	Priority
ImputationDiD event-study SEs recompute full conservative variance per horizon (should cache A0/A1 factorization)	`imputation.py`	#141	Low
Rust faer SVD ndarray-to-faer conversion overhead (minimal vs SVD cost)	`rust/src/linalg.rs:67`	#115	Low
Unrelated label events (e.g., adding `bug` label) re-trigger CI workflows when `ready-for-ci` is already present; filter `labeled`/`unlabeled` events to only `ready-for-ci` transitions	`.github/workflows/rust-test.yml`, `notebooks.yml`	#269	Low
`bread_inv` as a performance kwarg on `compute_robust_vcov` to avoid re-inverting `(X'WX)` when the caller already has it. Deferred from Phase 1a for scope. HC2 and HC2+BM both need the bread inverse, so a shared hint would save one `np.linalg.solve` per sandwich.	`linalg.py::compute_robust_vcov`	Phase 1a	Low
Rust-backend HC2 implementation. Current Rust path only supports HC1; HC2 and CR2 Bell-McCaffrey fall through to the NumPy backend. For large-n fits this is noticeable.	`rust/src/linalg.rs`	Phase 1a	Low
CR2 Bell-McCaffrey DOF uses a naive `O(n² k)` per-coefficient loop over cluster pairs. Pustejovsky-Tipton (2018) Appendix B has a scores-based formulation that avoids the full `n × n` `M` matrix. Switch when a user hits a large-`n` cluster-robust design.	`linalg.py::_compute_cr2_bm`	Phase 1a	Low

Testing/Docs

Issue	Location	PR	Priority
R comparison tests spawn separate `Rscript` per test (slow CI)	`tests/test_methodology_twfe.py:294`	#139	Low
CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path	`tests/test_methodology_callaway.py`	#202	Low
~1583 `duplicate object description` Sphinx warnings — restructure `docs/api/*.rst` to avoid duplicate `:members:` + `autosummary` (count grew from ~376 as API surface expanded)	`docs/api/*.rst`	—	Low
Doc-snippet smoke tests only cover `.rst` files; `.txt` AI guides outside CI validation	`tests/test_doc_snippets.py`	#239	Low
Add CI validation for `docs/doc-deps.yaml` integrity (stale paths, unmapped source files)	`docs/doc-deps.yaml`	#269	Low
Sphinx autodoc fails to import 3 result members: `DiDResults.ci`, `MultiPeriodDiDResults.att`, `CallawaySantAnnaResults.aggregate` — investigate whether these are renamed/removed or just unresolvable from autosummary template	`docs/api/results.rst`, `docs/api/staggered.rst`	—	Medium
`EDiDBootstrapResults` cross-reference is ambiguous — class is exported from both `diff_diff` and `diff_diff.efficient_did_bootstrap`, producing 3 "more than one target found" warnings. Add `:noindex:` to one source or use full-path refs	`diff_diff/efficient_did_results.py`, `docs/api/efficient_did.rst`	—	Low
Tracked Sphinx autosummary stubs in `docs/api/_autosummary/*.rst` are stale — every sphinx build regenerates them with new attributes (e.g., `coef_var`, `survey_metadata`) that have been added to result classes. Either commit a refresh or move the directory to `.gitignore` and treat as build output. Also 6 untracked stubs exist for newer estimators (`WooldridgeDiD`, `SimulationMDEResults`, etc.) that have never been committed.	`docs/api/_autosummary/`	—	Low
HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle.	`tests/test_methodology_honest_did.py:246`	—	Low

Standard Error Consistency

Different estimators compute SEs differently. Consider unified interface.

Estimator	Default SE Type
DifferenceInDifferences	HC1 or cluster-robust
TwoWayFixedEffects	Always cluster-robust (unit level)
CallawaySantAnna	Simple difference-in-means SE
SyntheticDiD	Bootstrap or placebo-based

Action: Consider adding se_type parameter for consistency across estimators.

Type Annotations

Mypy reports 0 errors. All mixin attr-defined errors resolved via TYPE_CHECKING-guarded method stubs in bootstrap mixin classes.

Deprecated Code

Deprecated parameters still present for backward compatibility:

lambda_reg and zeta in SyntheticDiD (synthetic_did.py)
- Deprecated in favor of zeta_omega/zeta_lambda parameters
- Remove in v4.0.0 (SemVer-safe: public kwarg removal requires a major bump)

Test Coverage

Note: 21 visualization tests are skipped when matplotlib unavailable—this is expected.

Honest DiD Improvements

Enhancements for honest_did.py:

Improved C-LF implementation with direct optimization instead of grid search (current implementation uses simplified FLCI approach with estimation uncertainty adjustment; see honest_did.py:947)
Support for CallawaySantAnnaResults (implemented in honest_did.py:612-653; requires aggregate='event_study' when calling CallawaySantAnna.fit())
Event-study-specific bounds for each post-period
Hybrid inference methods
Simulation-based power analysis for honest bounds

CallawaySantAnna Bootstrap Improvements

Consider aligning p-value computation with R did package (symmetric percentile method)

RuntimeWarnings in Linear Algebra Operations

Apple Silicon M4 BLAS Bug (numpy < 2.3)

Spurious RuntimeWarnings ("divide by zero", "overflow", "invalid value") are emitted by np.matmul/@ on Apple Silicon M4 + macOS Sequoia with numpy < 2.3. The warnings appear for matrices with ≥260 rows but do not affect result correctness — coefficients and fitted values are valid (no NaN/Inf), and the design matrices are full rank.

Root cause: Apple's BLAS SME (Scalable Matrix Extension) kernels corrupt the floating-point status register, causing spurious FPE signals. Tracked in numpy#28687 and numpy#29820. Fixed in numpy ≥ 2.3 via PR #29223.

Not reproducible on M3, Intel, or Linux.

linalg.py:162 - Warnings in fitted value computation (X @ coefficients)
- Caused by M4 BLAS bug, not extreme coefficient values
- Seen in test_prep.py during treatment effect recovery tests (n > 260)
triple_diff.py:307,323 - Warnings in propensity score computation
- Occurs in IPW and DR estimation methods with covariates
- Related to logistic regression overflow in edge cases (separate from BLAS bug)
Long-term: Revert to @ operator when numpy ≥ 2.3 becomes the minimum supported version.

Feature Gaps (from R `did` package comparison)

Features in R's did package that block porting additional tests:

Feature	R tests blocked	Priority	Status
Calendar time aggregation	1 test in test-att_gt.R	Low

Performance Optimizations

Potential future optimizations:

JIT compilation for bootstrap loops (numba)
Sparse matrix handling for large fixed effects

QR+SVD Redundancy in Rank Detection

Background: The current solve_ols() implementation performs both QR (for rank detection) and SVD (for solving) decompositions on rank-deficient matrices. This is technically redundant since SVD can determine rank directly.

Current approach (R-style, chosen for robustness):

QR with pivoting for rank detection (_detect_rank_deficiency())
scipy's lstsq with 'gelsd' driver (SVD-based) for solving

Why we use QR for rank detection:

QR with pivoting provides the canonical ordering of linearly dependent columns
R's lm() uses this approach for consistent dropped-column reporting
Ensures consistent column dropping across runs (SVD column selection can vary)

Potential optimization (future work):

Skip QR when rank_deficient_action="silent" since we don't need column names
Use SVD rank directly in the Rust backend (already implemented)
Add skip_rank_check parameter for hot paths where matrix is known to be full-rank (implemented in v2.2.0)

Priority: Low - the QR overhead is minimal compared to SVD solve, and correctness is more important than micro-optimization.

Incomplete `check_finite` Bypass

Background: The solve_ols() function accepts a check_finite=False parameter intended to skip NaN/Inf validation for performance in hot paths where data is known to be clean.

Current limitation: When check_finite=False, our explicit validation is skipped, but scipy's internal QR decomposition in _detect_rank_deficiency() still validates finite values. This means callers cannot fully bypass all finite checks.

Impact: Minimal - the scipy check is fast and only affects edge cases where users explicitly pass check_finite=False with non-finite data (which would be a bug in their code anyway).

Potential fix (future work):

Pass check_finite=False through to scipy's QR call (requires scipy >= 1.9.0)
Or skip _detect_rank_deficiency() entirely when check_finite=False and _skip_rank_check=True

Priority: Low - this is an edge case optimization that doesn't affect correctness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development TODO

Known Limitations

Code Quality

Large Module Files

Tech Debt from Code Reviews

Methodology/Correctness

Performance

Testing/Docs

Standard Error Consistency

Type Annotations

Deprecated Code

Test Coverage

Honest DiD Improvements

CallawaySantAnna Bootstrap Improvements

RuntimeWarnings in Linear Algebra Operations

Apple Silicon M4 BLAS Bug (numpy < 2.3)

Feature Gaps (from R `did` package comparison)

Performance Optimizations

QR+SVD Redundancy in Rank Detection

Incomplete `check_finite` Bypass

FilesExpand file tree

TODO.md

Latest commit

History

TODO.md

File metadata and controls

Development TODO

Known Limitations

Code Quality

Large Module Files

Tech Debt from Code Reviews

Methodology/Correctness

Performance

Testing/Docs

Standard Error Consistency

Type Annotations

Deprecated Code

Test Coverage

Honest DiD Improvements

CallawaySantAnna Bootstrap Improvements

RuntimeWarnings in Linear Algebra Operations

Apple Silicon M4 BLAS Bug (numpy < 2.3)

Feature Gaps (from R did package comparison)

Performance Optimizations

QR+SVD Redundancy in Rank Detection

Incomplete check_finite Bypass

Feature Gaps (from R `did` package comparison)

Incomplete `check_finite` Bypass