This document outlines the feature roadmap for diff-diff, organized as current state, queued work, candidates under consideration, and longer-term directions.
For past changes and release history, see CHANGELOG.md.
diff-diff is a production Python library for difference-in-differences causal inference with sklearn-like estimators and statsmodels-style output. It has feature parity with the standard R DiD ecosystem for core analysis, plus survey-design support that is not currently available in any other Python or R package.
- Core: Basic DiD, TWFE, MultiPeriod event study
- Heterogeneity-robust: Callaway-Sant'Anna (2021), Sun-Abraham (2021), Borusyak-Jaravel-Spiess Imputation (2024), Two-Stage DiD (Gardner 2022), Stacked DiD (Wing et al. 2024)
- Specialized: Synthetic DiD (Arkhangelsky et al. 2021), Triple Difference, Staggered Triple Difference (Ortiz-Villavicencio & Sant'Anna 2025), Continuous DiD (Callaway, Goodman-Bacon & Sant'Anna 2024), TROP
- Efficient: EfficientDiD (Chen, Sant'Anna & Xie 2025) - attains the semiparametric efficiency bound on the no-covariate path; offers an optional doubly-robust covariate path (sieve-based propensity ratios plus linear OLS outcome regression) that is DR-consistent but does not generically attain the bound
- Nonlinear: WooldridgeDiD / ETWFE (Wooldridge 2023, 2025) - saturated OLS, logit, Poisson QMLE with ASF-based ATT
- Reversible treatment: ChaisemartinDHaultfoeuille (de Chaisemartin & D'Haultfœuille AER 2020 + NBER WP 29873) - the only estimator in the library for non-absorbing (on/off) treatments, with full dynamic event study, covariates, group-specific trends, non-binary treatment, HonestDiD integration, and survey support
- Robust SEs, cluster SEs, wild bootstrap, multiplier bootstrap, placebo-based variance, jackknife (SyntheticDiD)
- Parallel trends tests, placebo tests, Goodman-Bacon decomposition, TWFE decomposition diagnostic
- Honest DiD sensitivity analysis (Rambachan & Roth 2023), pre-trends power analysis (Roth 2022)
- Power analysis, MDE, and sample-size tools (analytical + simulation), including survey-aware variants
- EPV diagnostics for propensity score estimation
SurveyDesign with strata, PSU, FPC, weight types (pweight/fweight/aweight), lonely PSU handling. All 16 estimators accept survey_design (15 inference-level + BaconDecomposition diagnostic):
- TSL variance (Taylor Series Linearization) with strata + PSU + FPC
- Replicate weights: BRR, Fay's BRR, JK1, JKn, SDR - 12 of 16 estimators
- Survey-aware bootstrap: PSU-level multiplier (IF-based) and Rao-Wu rescaled (resampling-based)
- Diagnostics: per-coefficient DEFF, subpopulation analysis, weight trimming, CV on estimates
- Repeated cross-sections:
CallawaySantAnna(panel=False)for BRFSS, ACS, CPS - Microdata-to-panel bridge:
aggregate_survey()helper with design-based precision - Survey-aware power analysis via
SurveyPowerConfig - R cross-validation: tests against R's
surveypackage using NHANES, RECS, and API datasets
See Survey Design Support for the compatibility matrix and survey-roadmap.md for the technical reference.
- Optional Rust backend for accelerated computation
- Label-gated CI (tests run only when
ready-for-cilabel is added); standalone CI Gate workflow - Documentation dependency map (
docs/doc-deps.yaml) with/docs-impactskill - AI practitioner guardrails based on Baker et al. (2025) 8-step workflow
- Runtime-accessible LLM guides via
get_llm_guide(...), bundled in the wheel - JOSS paper materials (
paper.md,paper.bib)
Major landings since the prior roadmap revision. See CHANGELOG.md for the full history.
BusinessReportandDiagnosticReport- practitioner-ready output layer. Plain-English stakeholder summaries + unified diagnostic runner with a stable AI-legibleto_dict()schema.BusinessReportauto-constructsDiagnosticReportby default so summaries mention pre-trends, robustness, and design-effect findings in one call. Estimator-native validation surfaces are routed through: SyntheticDiD usespre_treatment_fit/in_time_placebo/sensitivity_to_zeta_omega; EfficientDiD uses its nativehausman_pretest; TROP exposes factor-model fit metrics. Seedocs/methodology/REPORTING.mdfor methodology deviations including no-traffic-light gates, pre-trends verdict thresholds, and power-aware phrasing.- ChaisemartinDHaultfoeuille (dCDH) - full feature set:
DID_Mcontemporaneous-switch, multi-horizonDID_levent study, analytical SE, multiplier bootstrap, TWFE decomposition diagnostic, dynamic placebos, normalized estimator, cost-benefit aggregate, sup-t bands, covariate adjustment (DID^X), group-specific linear trends (DID^{fd}), state-set-specific trends, heterogeneity testing, non-binary treatment, HonestDiD integration, and survey support (TSL + pweight). - SyntheticDiD jackknife variance (
variance_method='jackknife') with survey-weighted jackknife. - SyntheticDiD validation diagnostics.
- Survey support completion - all 16 estimators accept
survey_design;aggregate_survey()microdata-to-panel bridge withsecond_stage_weightsparameter;conditional_ptDGP parameter for conditional-PT scenarios. - Survey-aware power analysis via
SurveyPowerConfig. - Practitioner onboarding - Brand Awareness Survey DiD tutorial, Geo-Experiment Analysis tutorial, practitioner decision tree, practitioner getting-started guide, README "For Data Scientists" section.
- Survey academic grounding -
survey-theory.mdmethodology document, research-grade survey DGP, expanded R-validation across additional estimators, flat-weight-vs-design-based comparison tutorial. - WooldridgeDiD / ETWFE estimator (Wooldridge 2023, 2025).
- Staggered Triple Difference (Ortiz-Villavicencio & Sant'Anna 2025).
- LLM guide bundling -
get_llm_guide()exposesllms.txt,llms-full.txt, andllms-practitioner.txtat runtime. - JOSS paper materials and CONTRIBUTORS.md.
- Python 3.14 support; standalone CI Gate workflow.
Queued work, ordered by expected leverage. Each item is its own PR. Ordering is priority-sequenced, not time-committed.
- Context-aware
practitioner_next_steps(). Substitutes actual column names from fitted results instead of generic placeholders, so next-step guidance is executable rather than illustrative. (Standalone follow-up to theBusinessReport/DiagnosticReportlanding below; tracked under the AI-Agent Track too.)
- dCDH comprehensive tutorial. One notebook covering reversible treatment, dynamic event study, covariates, trends, HonestDiD on placebos, and survey. Favara-Imbs (2015) banking-deregulation replication as the headline application.
- BRFSS repeated-cross-section tutorial. State-policy DiD replication using
CallawaySantAnna(panel=False)with design-based SEs and HonestDiD sensitivity. Targets the highest-demand survey-DiD audience segment. - Marketing Campaign Lift tutorial (CallawaySantAnna, staggered geo rollout).
- Pricing / Promotion Impact tutorial (ContinuousDiD dose-response).
- Two-phase sampling + multi-stage cluster R-validation tests. Extend existing survey cross-validation to NHANES two-phase design and MICS/DHS/NCVS multi-stage cluster. Closes a practitioner-design gap and firms up the design-based variance claim.
Research-informed candidates. Each has a rationale, a tractability note, and a commit criterion. Papers are academic references, so citation is fine.
- DiD with no untreated group (de Chaisemartin, Ciccia, D'Haultfœuille & Knau, arXiv:2405.04465, 2024, plus continuous-treatment-with-no-stayers companion, AEA P&P 2024). New estimator for designs where treatment is universal with heterogeneous dose (the inverse of the few-treated-many-donors case). Uses quasi-untreated units as controls. No existing diff-diff estimator handles this. Tractability: medium; closed-form identification. Status (2026-04-18): methodology plan approved; paper review at
docs/methodology/papers/dechaisemartin-2026-review.md, REGISTRY stub atdocs/methodology/REGISTRY.md#heterogeneousadoptiondid, class nameHeterogeneousAdoptionDiD, implementation queued across 7 phased PRs. Commit when: methodology plan drafted and validated against the paper's Pierce and Schott (2016) PNTR manufacturing-employment replication (Figure 2). - Nonparametric / flexible outcome regression for
EfficientDiDDR covariate path (Chen, Sant'Anna & Xie, arXiv:2506.17729, 2025, Section 4). The shipped staggeredEfficientDiDuses a linear OLS outcome regression in its doubly-robust covariate path; that preserves DR consistency but does not generically attain the semiparametric efficiency bound unless the conditional mean is linear in the covariates. Replacing the OLS outcome regression with sieve / kernel / ML nuisance estimation (as the paper's Section 4 allows) would close the efficiency gap on the covariate path. Tractability: medium; the hook points are indiff_diff/efficient_did_covariates.py. Commit when: a paper-review synthesis is written, with an implementation plan for the nonparametric OR that preserves the existing DR consistency guarantees and survey-weighted variance surface. - Distributional DiD for staggered timing (Ciaccio, arXiv:2408.01208, 2024). New estimator extending Callaway-Li QTT to staggered adoption.
CallawaySantAnnacurrently gives mean ATT only; this unlocks quantile effects. Tractability: medium. Commit when: a health-econ or public-health user reports need for quantile effects in a repeated-cross-section design. - Local Projections DiD (Dube, Girardi, Jordà & Taylor, JAE 2025). New estimator with flexible impulse-response and robustness to dynamic misspecification; natural for anticipation-prone settings. Tractability: well-scoped. Commit when: a methodology review confirms the dynamic variant's variance derivation fits our SE helpers.
- Few-treated-units inference option (Alvarez, Ferman & Wüthrich, arXiv:2504.19841, 2025).
inference=option covering t(G-1) corrections, randomization inference, and Ferman-Pinto-style permutation tests. Current SE paths assume large-G asymptotics. Tractability: medium. Commit when: a user reports sparse-treatment pain. - Riesz-representation sensitivity (Bach et al., arXiv:2510.09064, 2025). Confounder-based sensitivity bound complementing HonestDiD's trend-based bound. Tractability: medium. Commit when: HonestDiD users ask for confounder bounds.
- Compositional-change inference (Sant'Anna & Xu, arXiv:2304.13925 v3, 2025). Corrects inference for rolling-panel repeated-cross-section designs (ACS, CPS) where sample composition changes across periods. Tractability: medium. Commit when: BRFSS tutorial or an applied user surfaces the issue.
- Triple-difference identification-with-covariates audit (Ortiz-Villavicencio & Sant'Anna, arXiv:2505.09942, 2025). The paper shows common DDD implementations are invalid under covariate-conditional identification. Audit existing
TripleDifference/StaggeredTripleDifferenceagainst the paper. Tractability: small. Promote to Shipping Next if the audit finds a real issue.
Framed as what diff-diff offers, not which external tool plugs in:
- Standard post-estimation interface. Expose
.predict()and.vcov()in shapes that common post-estimation slope / contrast / hypothesis-test interfaces consume. Tractability: small Protocol addition plus compatibility shim. Commit when: a concrete contract with one of the existing results objects is defined. - Publication-table export.
result.to_table()producing publication-quality HTML / PNG / LaTeX tables via an optional extra. Tractability: low. Commit when:BusinessReportships so the formatter can piggyback on its summary pipeline. - Survey design object interop.
SurveyDesign.from_design_object(...)/.to_design_object(...)for accepting and emitting standard Python survey-design objects. Tractability: depends on upstream API stability. Commit when: a stable public design surface exists upstream. - Pluggable regression engine for TWFE / event-study paths. Opt-in
engine=parameter allowing alternative backends. Tractability: contained change plus coefficient-parity CI. Commit when: profiling shows material wins on real practitioner panels.
- New estimators beyond the list above without a user-driven demand signal.
- Calibration / raking / post-stratification as first-party features (remain upstream; document the handoff).
- Product Launch Regional Rollout and Loyalty Program tutorials (defer until a practitioner request).
- Methodology-vs-alternative comparison pages (replaced by BusinessReport and the tutorials that showcase diff-diff's output directly).
Long-running program, framed as "building toward" rather than with discrete ship dates.
Vision. A practitioner hands an AI agent a business scenario. The agent, with diff-diff as its toolkit, interprets the scenario, selects the correct estimator and identification strategy, executes the analysis with correct diagnostics and sensitivity, and returns a business-ready report. Practitioners never see raw coefficients unless they want to.
Building blocks already in place.
- Baker et al. (2025) 8-step workflow enforcement in
diff_diff/practitioner.py. practitioner_next_steps()context-aware guidance.- Runtime LLM guides via
get_llm_guide(...)(llms.txt,llms-full.txt,llms-practitioner.txt), bundled in the wheel. - Silent-operation warnings so agents and humans see the same signals at the same time.
Next blocks toward the vision.
- BusinessReport / DiagnosticReport (in Shipping Next) - the output form the vision assumes.
- Context-aware
practitioner_next_steps()that substitutes actual column names - turns guidance into executable recommendations. - AI-legible diagnostic surfaces - once BusinessReport ships, a structured JSON counterpart that agents can parse without screen-scraping human text.
- Scenario-to-estimator selection guidance - agent-facing extension of
docs/practitioner_decision_tree.rstthat returns a specific estimator choice plus rationale for a given scenario description. - End-to-end scenario walkthrough templates - reusable orchestration recipes an agent can adapt from data ingest through business-ready output.
Frontier methods that may graduate to Under Consideration given time and research signals.
Extends DiD to duration / survival outcomes where standard methods fail (hazard rates, time-to-event). Duration analogue of parallel trends; avoids distributional and hazard-function assumptions.
Reference: Deaner & Ku (2025), AEA Conference Paper.
Standard DiD assumes SUTVA; spatial and network spillovers violate this. Two-stage imputation approaches estimate treatment and spillover effects jointly under staggered timing.
Reference: Butts (2024), working paper.
Recover the full counterfactual distribution and quantile treatment effects (QTT), not just mean ATT. Changes-in-Changes (CiC) identification strategy.
Reference: Athey & Imbens (2006), Econometrica. (Ciaccio 2024 extension listed under Under Consideration.)
ML-powered conditional ATT, using a doubly robust meta-learner to discover which units benefit most from treatment.
Reference: Lan, Chang, Dillon & Syrgkanis (2025), working paper.
Machine-learning methods for discovering heterogeneous treatment effects in DiD settings. Recent applied-econometrics work (Gavrilova et al. 2025, Journal of Applied Econometrics) demonstrates the approach on panel data.
References: Athey & Wager (2019), Annals of Statistics; Kattenberg, Scheer & Thiel (2023), CPB Discussion Paper.
Unified framework encompassing synthetic control and regression approaches via low-rank matrix recovery.
Reference: Athey et al. (2021), Journal of the American Statistical Association.
Machine learning nuisance estimation in high-dimensional DiD settings.
Reference: Chernozhukov et al. (2018), The Econometrics Journal.
- Randomization inference: exact p-values for small samples.
- Bayesian DiD: priors on parallel-trends violations.
- Conformal inference: prediction intervals with finite-sample guarantees.
Interested in contributing? Under Consideration items with clear commit criteria are good candidates. See the GitHub repository for open issues.
Key references:
- Roth et al. (2023). "What's Trending in Difference-in-Differences?" Journal of Econometrics.
- Baker et al. (2025). "Difference-in-Differences Designs: A Practitioner's Guide."
- Abadie, Angrist, Frandsen & Pischke (2025). "Harvesting Differences-in-Differences and Event-Study Evidence." NBER WP 34550.