Survey Phase 7: CS IPW/DR covariates, repeated cross-sections, HonestDiD survey variance by igerber · Pull Request #240 · igerber/diff-diff

igerber · 2026-03-28T23:44:07Z

Summary

Phase 7a: Remove NotImplementedError gate for CallawaySantAnna IPW/DR + covariates + survey. Implement DRDID panel nuisance IF corrections (propensity score + outcome regression) for both survey-weighted and non-survey DR paths (Sant'Anna & Zhao 2020, Theorem 3.1). Extract _safe_inv() helper for matrix inversions.
Phase 7d: Thread survey degrees of freedom through HonestDiD for t-distribution critical values. Compute full event-study variance-covariance matrix from influence function vectors in CallawaySantAnna aggregation. Add event_study_vcov field to CallawaySantAnnaResults and survey_metadata/df_survey to HonestDiDResults.
Phase 7b: Add panel=False for repeated cross-section support in CallawaySantAnna. New _precompute_structures_rc(), _compute_att_gt_rc(), and three RC estimation methods (_outcome_regression_rc, _ipw_estimation_rc, _doubly_robust_rc) with covariates and survey weights. Canonical index abstraction in aggregation/bootstrap mixins. RCS data generator via generate_staggered_data(panel=False).

Methodology references

Method name(s): Callaway-Sant'Anna (2021), Sant'Anna & Zhao (2020) DRDID panel/cross-section, Rambachan & Roth (2023) HonestDiD
Paper / source link(s):
- Sant'Anna, P.H.C. & Zhao, J. (2020). "Doubly Robust Difference-in-Differences Estimators." J. Econometrics 219(1). Theorem 3.1 (panel IF corrections), Section 4 (cross-sectional DRDID).
- Callaway, B. & Sant'Anna, P.H.C. (2021). "Difference-in-Differences with Multiple Time Periods." J. Econometrics 225(2). Section 4.1 (repeated cross-sections).
- Rambachan, A. & Roth, J. (2023). "A More Credible Approach to Parallel Trends." Rev. Econ. Studies 90(5).
Intentional deviations: DR nuisance IF corrections use the same survey-weighted Hessian/score pattern as the existing IPW path. Non-survey DR path also receives IF corrections (was plug-in only). Per-cell SEs remain IF-based (not full TSL) — documented in REGISTRY.md. Event-study VCV under replicate weights falls back to diagonal (multivariate replicate VCV deferred).

Validation

Tests added/updated:
- tests/test_survey_phase7a.py (22 tests): smoke, scale invariance, uniform-weight equivalence, IF correction, aggregation, bootstrap, edge cases
- tests/test_staggered_rc.py (23 tests): all methods, covariates, survey, aggregation, bootstrap, control groups, base periods, data generator, edge cases
- tests/test_honest_did.py (+4 tests): survey df extraction, VCV computation, bounds widening, no-survey baseline
- tests/test_survey_phase4.py: 2 negative tests converted to positive assertions
Full test suite: 365 tests pass across all affected files (0 failures)

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…DiD survey variance Phase 7a: Remove NotImplementedError gate for IPW/DR + covariates + survey. Add DRDID panel nuisance IF corrections (PS + OR) for both survey and non-survey DR paths. Extract _safe_inv helper for matrix inversions. Phase 7d: Thread survey df through HonestDiD for t-distribution critical values. Compute full event-study VCV from influence function vectors. Add event_study_vcov to CallawaySantAnnaResults. Phase 7b: Add panel=False for repeated cross-section support in CallawaySantAnna. New _precompute_structures_rc, _compute_att_gt_rc, and three RC estimation methods (reg, ipw, dr) with covariates and survey weights. Canonical index abstraction in aggregation/bootstrap. RCS data generator in generate_staggered_data(panel=False). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-28T23:52:17Z

PR Review

Overall assessment

⚠️ Needs changes. The highest unmitigated severity is P1: the new repeated-cross-section CallawaySantAnna paths do not fully match the documented/source-method inference contract, and the new HonestDiD survey covariance plumbing diverges from the registry for replicate-weight designs.

Executive summary

Affected methods: Callaway-Sant'Anna (panel=False, IPW/DR, aggregation) and HonestDiD (survey-aware event-study covariance).
The Phase 7a panel DR survey changes look broadly aligned with the updated registry; I did not find a blocker in the new panel nuisance-correction code itself.
The new repeated-cross-section IPW/DR analytic SEs are still plug-in only and do not implement the cross-sectional nuisance-estimation IF corrections the registry says are supported.
For panel=False, unweighted simple/event-study aggregation uses time-specific treated-cell counts while the WIF code uses full-sample cohort shares, so the weighting contract is internally inconsistent.
The new HonestDiD covariance path for replicate-weight surveys does not implement the documented diagonal fallback; it builds a full Psi'Psi covariance and passes that to HonestDiD.
The new public panel parameter is not propagated into CallawaySantAnnaResults, and result summaries still label repeated-cross-section observation counts as “units”.

Methodology

Cross-check basis: docs/methodology/REGISTRY.md:L291-L319, docs/methodology/REGISTRY.md:L419-L424, and docs/methodology/REGISTRY.md:L1633-L1637.

Severity: P1. diff_diff/staggered.py:L2872-L2978, diff_diff/staggered.py:L2980-L3127, docs/methodology/REGISTRY.md:L423-L424.
Impact: the new repeated-cross-section IPW and DR analytic inference does not include nuisance-estimation IF corrections. _ipw_estimation_rc() computes SE from the plug-in IF only, and _doubly_robust_rc() likewise stops at the plug-in IF. That is a mismatch with the registry’s claim that panel=False uses Section 4 cross-sectional DRDID with per-observation influence functions, and it is notably weaker than the panel IPW/DR code in the same file, which now adds explicit PS/OR correction terms. The result is understated or otherwise incorrect SEs/CIs/p-values for covariate-adjusted RCS IPW/DR.
Concrete fix: implement the Section 4 cross-sectional nuisance-estimation IF corrections for panel=False IPW/DR, or explicitly document the deviation in REGISTRY.md and disable analytic inference for those branches until the correct IF is in place.
Severity: P1. diff_diff/staggered_aggregation.py:L37-L152, diff_diff/staggered_aggregation.py:L289-L314, diff_diff/staggered_aggregation.py:L574-L645, diff_diff/staggered_bootstrap.py:L223-L267, diff_diff/staggered_bootstrap.py:L560-L657.
Impact: the new unweighted panel=False aggregation uses data["n_treated"] from each (g,t) cell as the aggregation weight, but the WIF path for the same estimator computes pg from full-sample cohort counts. In panel data those coincide because cohort size is constant across t; in repeated cross-sections they generally do not. That means the point estimate, WIF correction, and bootstrap aggregation are no longer using the same weight definition. This changes the estimand/finite-sample weighting and makes the SE formula internally inconsistent with the aggregated estimator.
Concrete fix: precompute fixed cohort masses for panel=False once from the full repeated-cross-section sample, then use those same cohort masses everywhere simple/event-study/bootstrap weights are formed.
Severity: P1. diff_diff/staggered_aggregation.py:L710-L739, diff_diff/honest_did.py:L664-L669, docs/methodology/REGISTRY.md:L1637-L1637.
Impact: the registry explicitly says replicate-weight event-study covariance should fall back to a diagonal matrix until multivariate replicate VCV is implemented, but _aggregate_event_study() currently builds a full Psi.T @ Psi matrix for all non-TSL cases, which includes replicate-weight designs. HonestDiD then consumes that full matrix whenever event_study_vcov is present. That is an undocumented methodology mismatch and can change HonestDiD bounds under replicate designs without warning.
Concrete fix: when uses_replicate_variance is true, do not populate a full off-diagonal event_study_vcov; set it to None or an explicit diagonal-from-SEs fallback until a proper multivariate replicate covariance estimator is implemented and validated.

Code Quality

No additional findings beyond the methodology issues above.

Performance

No findings.

Maintainability

Severity: P1. diff_diff/staggered.py:L258-L336, diff_diff/staggered.py:L1374-L1386, diff_diff/staggered.py:L1774-L1785, diff_diff/staggered_results.py:L63-L123, diff_diff/staggered_results.py:L157-L163.
Impact: the new public panel parameter is not stored on CallawaySantAnnaResults, even though it changes how counts and aggregation should be interpreted. For panel=False, the fit path stores observation counts in n_treated_units / n_control_units, but the results API and summary still present them as “units”. That makes downstream use ambiguous and violates the expected propagation pattern for new public parameters.
Concrete fix: add panel: bool (or equivalent index-space metadata) to CallawaySantAnnaResults, populate it in fit(), and switch summary labels/output semantics to observation-based wording when panel=False.

Tech Debt

No separate findings. The new TODO.md changes only resolve prior items; they do not track the P1 issues above, so those remain unmitigated.

Security

No findings.

Documentation/Tests

Severity: P2. diff_diff/staggered.py:L119-L184, docs/api/staggered.rst:L15-L32, tests/test_staggered_rc.py:L48-L351, tests/test_honest_did.py:L1125-L1244.
Impact: the public CallawaySantAnna class docstring still does not describe panel, so the generated API docs do not expose the new parameter contract. The new RCS tests mostly check “finite result” behavior, not methodology alignment, aggregation weights, or SE correctness. The HonestDiD additions test Taylor-style survey inputs, but there is no test for the new replicate-weight covariance branch, which is exactly where the implementation currently diverges from the registry.
Concrete fix: document panel in the estimator docstring/API docs, add a reference-validation test against did::att_gt(panel=FALSE) for at least one aggregation case and one IPW/DR covariate case, and add a replicate-weight HonestDiD test that asserts diagonal fallback until multivariate replicate VCV is implemented.

Path to Approval

Implement cross-sectional nuisance-estimation IF corrections for panel=False IPW/DR analytic inference, or explicitly document and temporarily disable those analytic SEs.
Replace per-cell n_treated weighting with fixed cohort-mass weighting throughout panel=False simple/event-study/bootstrap aggregation.
Make replicate-weight HonestDiD use the documented diagonal event-study covariance fallback instead of a full Psi.T @ Psi matrix.
Propagate panel into CallawaySantAnnaResults and correct result-summary labels/semantics for repeated cross-sections.
Add reference tests for panel=False methodology and a replicate-weight HonestDiD regression test.

Static review only; I was not able to do runtime spot-checks in this environment because the Python runtime here is missing required dependencies such as numpy.

… VCV, panel on results Fix 5 findings from PR #240 CI review: - Add cross-sectional nuisance IF corrections (PS + OR) to _ipw_estimation_rc and _doubly_robust_rc, matching panel path methodology - Use fixed full-sample cohort masses for unweighted RCS aggregation weights (consistency with WIF group-share denominator) - Guard replicate-weight designs from full event-study VCV (diagonal fallback) - Add panel field to CallawaySantAnnaResults, fix summary labels for RCS - Add panel to class docstring, replicate VCV test, RCS IF correction test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-29T00:15:06Z

/ai-review

github-actions · 2026-03-29T00:23:10Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4bf566d1f94ebca6500b5ffe806c87302aa9bb9a

PR Review

Overall assessment

⚠️ Needs changes. The highest unmitigated severity is P1. The prior HonestDiD replicate-weight fallback and panel propagation issues look resolved, but two repeated-cross-section inference problems remain.

Executive Summary

The remaining blockers are both in the new Callaway-Sant'Anna panel=False path.
Repeated-cross-section DR analytical inference is still internally inconsistent: the nuisance IF corrections reuse the post-period treated denominator for both the post and base-period components.
The earlier RCS weighting-contract fix is only partial: balance_e event-study aggregation and the unweighted bootstrap overall/event-study helpers still revert to per-cell n_treated counts instead of fixed cohort masses.
HonestDiD’s survey-variance changes now appear aligned with the updated registry, including the replicate-weight diagonal fallback.
The new RC tests are mostly smoke/finite-result checks and would not catch either remaining RC inference defect.

Methodology

Affected methods: Callaway-Sant'Anna repeated cross sections (panel=False, Section 4-style DRDID and aggregation/bootstrap). The HonestDiD survey-variance path looks consistent with the updated registry note.

Severity: P1. Impact: In panel=False DR, the point estimator correctly uses separate treated normalizers for the post and base-period pieces (sw_gt_sum vs sw_gs_sum, or n_gt vs n_gs), but the nuisance IF corrections collapse both periods onto a single normalizer = sum(sw_gt) or n_gt. That mis-scales both the PS correction and the base-period OR correction whenever the cohort-g sample size or treated weight sum differs across periods, which is the ordinary repeated-cross-section case. The resulting analytical SE/CIs/p-values are inconsistent with the estimator and with the Section 4 repeated-cross-section decomposition promised in the registry. Concrete fix: use separate normalizer_t and normalizer_s throughout M2_dr, M1_t, and M1_s, matching the denominators used in att_t_aug and att_s_aug; add a regression test with n_gt != n_gs and unequal treated weight sums. References: diff_diff/staggered.py:L3118-L3159 diff_diff/staggered.py:L3184-L3228 docs/methodology/REGISTRY.md:L423-L424
Severity: P1. Impact: The earlier panel=False weighting-contract bug is only partially fixed. Analytical simple/event-study aggregation now uses fixed cohort masses, but unweighted event-study with balance_e and the unweighted bootstrap helpers still fall back to cell-specific n_treated. In repeated cross-sections those cell counts vary by period, so the bootstrap SEs/CIs and balance_e event-study weights no longer correspond to the estimator/WIF denominator used elsewhere in the same results object. Concrete fix: compute one unweighted cohort-mass map from precomputed["unit_cohorts"] and use it everywhere panel=False aggregation weights are formed, including the balance_e branch in _aggregate_event_study(), overall bootstrap weights, and _prepare_event_study_aggregation(). References: diff_diff/staggered_aggregation.py:L76-L100 diff_diff/staggered_aggregation.py:L590-L610 diff_diff/staggered_aggregation.py:L621-L648 diff_diff/staggered_bootstrap.py:L229-L243 diff_diff/staggered_bootstrap.py:L561-L575 diff_diff/staggered_bootstrap.py:L614-L631

Code Quality

No findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. Neither remaining P1 issue is tracked in TODO.md:L48-L60, so TODO.md does not mitigate them.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new RC tests are largely finiteness/smoke checks. They do not exercise the two cases that break the current implementation: unequal treated cohort sizes across the two RC periods, and panel=False aggregation/bootstrap weight formation under balance_e or n_bootstrap>0. As written, the suite would pass with both P1 bugs still present. Concrete fix: add deterministic panel=False fixtures where cohort-g counts differ between t and s, assert the corrected DR SE against a hand-built IF/reference calculation, and add bootstrap/event-study tests that verify fixed cohort-mass weights under balance_e and n_bootstrap>0. References: tests/test_staggered_rc.py:L135-L152 tests/test_staggered_rc.py:L258-L300 tests/test_staggered_rc.py:L362-L381

Path to Approval

Fix the repeated-cross-section DR nuisance IF normalization so the post-period and base-period PS/OR corrections use their own treated denominators (sw_gt_sum/sw_gs_sum or n_gt/n_gs).
Finish propagating fixed cohort-mass weighting through all panel=False aggregation paths, especially balance_e event-study aggregation and the unweighted overall/event-study bootstrap helpers.
Add regression tests covering unequal RC cohort counts across periods and panel=False bootstrap/event-study weight consistency.

Static review only: I could not execute the test suite here because this environment lacks pytest and scientific Python dependencies such as numpy.

…equal-count tests - Use separate normalizer_t/normalizer_s in _doubly_robust_rc() IF corrections (post vs base period treated denominators must match the DR estimator) - Precompute rcs_cohort_masses in _precompute_structures_rc() and return cohort mass as n_treated from _compute_att_gt_rc() — fixes all downstream consumers (aggregation, bootstrap, balance_e) at the source instead of per-consumer patches - Remove now-unnecessary rcs_cohort_counts blocks from aggregation - Add unequal cohort count test fixture and regression tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-29T00:50:59Z

/ai-review

github-actions · 2026-03-29T01:01:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6080f927c1c76a760b7825b59d5f46032da64b1c

PR Review

Overall assessment
⚠️ Needs changes. The highest unmitigated severity is P1. The prior RC normalizer and fixed-cohort-weighting blockers appear resolved, but the new panel=False estimator formulas still do not line up with the repeated-cross-section methods this PR claims to implement.

Executive summary

The earlier re-review blockers look fixed: RC DR no longer reuses one denominator for both periods, and fixed cohort-mass weighting now propagates through analytical aggregation and bootstrap.
The remaining blocker is methodological: the new repeated-cross-section reg and dr paths do not match the DRDID / did::att_gt(panel=FALSE) estimators they are supposed to mirror.
_outcome_regression_rc() uses separate pre/post treated residual averages, but the reference reg_did_rc estimator pools treated weights when averaging the predicted change.
_doubly_robust_rc() is further off: it uses only control-group ORs and normalizes the control augmentation terms by treated-period masses, which does not match either drdid_rc or the simpler AIPW repeated-cross-section formula.
The added RC tests are almost entirely smoke/finite-result checks, so this kind of formula mismatch passes undetected.
The HonestDiD survey-df / event-study-vcov changes look consistent with the new registry note.

Methodology

Severity: P1. Impact: The new repeated-cross-section reg path in _outcome_regression_rc at diff_diff/staggered.py:L2795 computes ATT = mean_t(Y - m_t(X)) - mean_s(Y - m_s(X)) using separate treated averages for the post and base periods (diff_diff/staggered.py:L2843, diff_diff/staggered.py:L2859, diff_diff/staggered.py:L2869). did::att_gt(panel=FALSE, est_method="reg") dispatches to DRDID::reg_did_rc, and that estimator averages the predicted change over the treated group with pooled treated weights rather than separate pre/post treated residual means. That is a different finite-sample estimator whenever treated-sample composition differs across the two cross-sections. The registry note at docs/methodology/REGISTRY.md:L423 documents panel=False support, but not this estimator change. Concrete fix: Rework _outcome_regression_rc() and its IF to match reg_did_rc exactly, or explicitly document and rename a different RC regression estimator if that deviation is intentional. citeturn3view0turn5view1
Severity: P1. Impact: The new repeated-cross-section dr path in _doubly_robust_rc at diff_diff/staggered.py:L3031 does not match the cited DRDID repeated-cross-section estimators. The point estimator uses only control-group ORs (diff_diff/staggered.py:L3059) and divides the control augmentation terms by treated-period masses (diff_diff/staggered.py:L3131, diff_diff/staggered.py:L3153), with the same normalization baked into the IF corrections (diff_diff/staggered.py:L3190, diff_diff/staggered.py:L3211). But did::att_gt(panel=FALSE, est_method="dr") dispatches to DRDID::drdid_rc; that locally efficient estimator includes treated- and control-group outcome-regression pieces in both periods, and even the simpler AIPW repeated-cross-section formula normalizes each treated/control pre/post component by its own weight sum rather than the treated totals. So the current code changes both point estimates and IF-based SEs whenever reweighted control mass differs from treated mass. Concrete fix: Pick one specific Section 4 DR estimator (drdid_rc, aipw_did_rc1, or another named variant), implement its point estimator and IF end-to-end, and update the registry to name that exact estimator. citeturn3view1turn5view2turn1search4

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P2. Impact: In the RC path, _compute_att_gt_rc() stores the full cohort mass in n_treated so downstream aggregation can reuse existing weight plumbing (diff_diff/staggered.py:L2710), but the public results contract still documents n_treated as the number of treated observations for that group-time cell (diff_diff/staggered_results.py:L21). That silently turns a reporting field into an aggregation-weight field. Concrete fix: Keep n_treated as the actual cell count and add a separate cohort_mass / agg_weight field for RC aggregation.

Tech Debt

No separate findings. The P1 methodology issue above is not tracked in TODO.md:L48, so it remains unmitigated.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new RC test file is almost entirely smoke coverage: it checks that fits are finite, methods run, and bootstrap returns something, but never compares panel=False reg/dr against DRDID formulas, did outputs, or even hand-built 2×2 RC reference values (tests/test_staggered_rc.py:L48, tests/test_staggered_rc.py:L160, tests/test_staggered_rc.py:L413). That means the formula mismatch above passes cleanly. Concrete fix: Add deterministic repeated-cross-section fixtures where treated covariate composition differs between pre and post, then assert reg/dr ATT and SE outputs against checked-in DRDID / R reference values or hand-verified formulas.

Path to Approval

Re-implement _outcome_regression_rc() to match the repeated-cross-section reg_did_rc formula and influence function used by did::att_gt(panel=FALSE, est_method="reg").
Re-implement _doubly_robust_rc() to match one named DRDID repeated-cross-section estimator exactly, including the correct normalization and nuisance-IF terms.
Add RC reference-value tests that fail under the current formulas, not just finiteness/smoke checks.

…ulas _outcome_regression_rc: Pool all treated obs for OR correction term (was: separate per-period averages). Period-specific treated means for Y. Matches Sant'Anna & Zhao (2020) Eq 2.2 / R reg_did_rc exactly. _doubly_robust_rc: Fit 4 OLS models (control+treated, pre+post) for locally efficient DR estimator (was: 2 control-only). Implements tau_1 (AIPW) + tau_2 (local efficiency adjustment) with full 11-component IF. Matches Sant'Anna & Zhao (2020) Eq 3.3+3.4 / R drdid_rc exactly. Add agg_weight field to group_time_effects for RCS aggregation weight (cohort mass), separate from n_treated (per-cell display count). Aggregation uses data.get("agg_weight", data["n_treated"]) for backward compatibility with panel data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-29T01:47:57Z

/ai-review

github-actions · 2026-03-29T01:56:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b623deeed20d79ab05bdf255819745c93140371a

Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Executive Summary

The prior repeated-cross-section point-estimator blockers appear addressed: the new reg path now pools the treated OR correction, and the new dr path includes treated-side OR terms plus nuisance IF corrections.
[Newly identified] The non-survey panel=False bootstrap path still reaggregates with realized cell counts n_{g,t} instead of the fixed cohort mass agg_weight/N_g, so bootstrap overall and event-study inference target a different estimator than the analytical path.
That mismatch is not covered by the new methodology note for repeated cross-sections in docs/methodology/REGISTRY.md:L423, and it is not tracked in TODO.md:L48-L60 or TODO.md:L167-L170.
The HonestDiD survey-df / event-study-vcov changes are consistent with the new registry note; no separate methodology defect stood out there.
The new RC bootstrap tests are smoke-only, so this weighting regression currently slips through.

Methodology

Severity: P1 [Newly identified] Impact: In the new repeated-cross-section path, fit() stores a fixed cohort aggregation weight agg_weight for each (g,t) cell, and the analytical overall/event-study aggregations correctly use that fixed cohort mass rather than the per-cell treated count. But the non-survey bootstrap code still falls back to group_time_effects[gt]["n_treated"] for both overall ATT reaggregation and event-study reaggregation. References: diff_diff/staggered.py:L1489-L1515, diff_diff/staggered_aggregation.py:L76-L88, diff_diff/staggered_aggregation.py:L576-L586, diff_diff/staggered_bootstrap.py:L223-L242, diff_diff/staggered_bootstrap.py:L570-L630, docs/methodology/REGISTRY.md:L423. In unequal-cohort RC samples, that silently changes the bootstrap target from fixed-N_g weighting to varying n_{g,t} weighting, so overall/event-study bootstrap SEs, percentile CIs, and p-values are no longer attached to the same estimator reported by the analytical path. Concrete fix: make the non-survey bootstrap read agg_weight (or the precomputed fixed cohort mass) everywhere it currently reads n_treated for RC overall/event-study reaggregation.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3 Impact: event_study_vcov is cached on the estimator instance and later copied into results via getattr(self, "_event_study_vcov", None). If the same CallawaySantAnna object is reused, a fit that does not compute event-study aggregation can inherit stale covariance metadata from a previous fit. References: diff_diff/staggered_aggregation.py:L714-L755, diff_diff/staggered.py:L1772-L1796. Concrete fix: reset self._event_study_vcov = None at the start of fit() and only attach it when the current fit actually computed event_study_effects.

Tech Debt

No separate findings. The P1 bootstrap-weight issue above is not currently tracked in TODO.md:L48-L60 or TODO.md:L167-L170, so it remains unmitigated.

Security

No findings.

Documentation/Tests

Severity: P2 Impact: The new RC bootstrap coverage only checks that bootstrap runs, including on the unequal-cohort fixture, but never asserts that bootstrap aggregation uses the fixed cohort masses introduced for panel=False. References: tests/test_staggered_rc.py:L289-L300, tests/test_staggered_rc.py:L413-L482. Concrete fix: add a deterministic unequal-cohort panel=False test that verifies bootstrap overall/event-study prep uses the same fixed cohort weights as the analytical aggregation path.

Path to Approval

Update CallawaySantAnnaBootstrapMixin._run_multiplier_bootstrap() so non-survey panel=False overall ATT bootstrap weights use fixed cohort mass (agg_weight / precomputed cohort mass), not per-cell n_treated.
Update _prepare_event_study_aggregation() so non-survey panel=False event-study bootstrap weights use the same fixed cohort masses.
Add a regression test on the unequal-cohort RC fixture that fails if bootstrap reaggregation falls back to per-cell n_treated instead of fixed cohort mass.

- Bootstrap overall/event-study reaggregation now uses agg_weight (fixed cohort mass) for panel=False, matching the analytical aggregation path - Reset self._event_study_vcov = None at start of fit() to prevent stale VCV from prior fit leaking into reused estimator objects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-29T01:59:22Z

/ai-review

github-actions · 2026-03-29T02:09:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3b405b7f9009afb4d8d90397d931b327a5c65825

Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Static review only: I could not execute the changed tests in this environment because the default Python interpreter here does not have the project dependencies available.

Executive Summary

The prior repeated-cross-section bootstrap weighting blocker appears addressed: the new agg_weight plumbing is present in both analytical and bootstrap overall/event-study aggregation paths for panel=False.diff_diff/staggered_aggregation.py:L76-L88 diff_diff/staggered_aggregation.py:L576-L585 diff_diff/staggered_bootstrap.py:L223-L248 diff_diff/staggered_bootstrap.py:L576-L584
[Newly identified] Bootstrap-fit CallawaySantAnna results now carry an analytical event_study_vcov, and HonestDiD prefers that matrix over the bootstrap-updated event-study SEs. On n_bootstrap>0 fits, sensitivity analysis therefore uses a different variance path than the main reported event-study results.diff_diff/staggered_aggregation.py:L714-L755 diff_diff/staggered.py:L1709-L1733 diff_diff/staggered.py:L1773-L1799 diff_diff/honest_did.py:L664-L670
The new survey-df threading and replicate-weight diagonal fallback in HonestDiD look consistent with the new registry note; I did not find a separate methodology defect there.diff_diff/honest_did.py:L615-L685 diff_diff/staggered_aggregation.py:L740-L747 docs/methodology/REGISTRY.md:L1637-L1637
The added HonestDiD tests only cover the analytical VCV path and replicate fallback, not the bootstrap-to-HonestDiD path, so the variance-path regression above is currently unguarded.tests/test_honest_did.py:L1158-L1281

Methodology

Severity: P1 [Newly identified]. Impact: CallawaySantAnna.fit() computes and stores event_study_vcov from analytical IF vectors during event-study aggregation,diff_diff/staggered_aggregation.py:L714-L755 then, when n_bootstrap>0, overwrites event_study_effects[*]["se"], CIs, and p-values with bootstrap results while leaving that covariance matrix unchanged on the results object.diff_diff/staggered.py:L1709-L1733 diff_diff/staggered.py:L1773-L1799 HonestDiD now always prefers event_study_vcov when present,diff_diff/honest_did.py:L664-L670 so bootstrap-fit CS results silently feed analytical covariance into sensitivity analysis. That contradicts the Phase 7d intent that HonestDiD respect the same variance structure as the underlying event study.docs/methodology/REGISTRY.md:L1637-L1637 Concrete fix: when bootstrap inference is used for event-study results, either compute and store a bootstrap event-study covariance matrix from the bootstrap draws, or clear/ignore event_study_vcov so HonestDiD falls back to the bootstrap variance path instead of mixing analytical and bootstrap inference.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. The P1 above is not tracked in TODO.md:L48-L60 or TODO.md:L167-L170, so it remains unmitigated.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new HonestDiD tests validate analytical event_study_vcov creation and the replicate-weight diagonal fallback, but they do not cover CallawaySantAnna(..., n_bootstrap>0, aggregate="event_study"). That leaves the new covariance-source mismatch untested.tests/test_honest_did.py:L1158-L1281 Concrete fix: add a regression test that fits a bootstrap event study, runs HonestDiD.fit(), and asserts that the covariance source matches the bootstrap path (or that event_study_vcov is absent/recomputed from bootstrap draws on bootstrapped fits).

Path to Approval

Make bootstrap-fit CallawaySantAnna event-study results expose a variance path that matches the reported bootstrap SEs: either store a bootstrap event-study covariance matrix, or suppress/ignore analytical event_study_vcov on bootstrapped fits.
Add a targeted regression test for CallawaySantAnna(n_bootstrap>0, aggregate="event_study") -> HonestDiD.fit(...) that fails if HonestDiD consumes the analytical IF covariance instead of the bootstrap path.
If the analytical-on-bootstrap behavior is intentional, document that explicitly in docs/methodology/REGISTRY.md; otherwise the current behavior remains an undocumented methodology deviation.

…dy SEs Prevents HonestDiD from mixing analytical IF-based VCV with bootstrap SEs on bootstrap-fit CallawaySantAnna results. When n_bootstrap>0, the event_study_vcov is set to None so HonestDiD falls back to diagonal from the bootstrap SEs (consistent variance path). Add regression test: bootstrap CS → HonestDiD asserts vcov is None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-29T02:17:17Z

/ai-review

github-actions · 2026-03-29T02:33:33Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 53cfd5d5b06dd44f71f5272fe16a5e537543a327

Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Static review only: I could not execute the added tests here because pytest is not installed in this environment.

Executive Summary

The prior HonestDiD/bootstrap variance-path blocker appears fixed: bootstrap CS fits now clear analytical event_study_vcov, and there is a targeted regression test for that path. diff_diff/staggered.py:1323 diff_diff/staggered.py:1777 tests/test_honest_did.py:1283
The new repeated-cross-section covariate-adjusted reg path is not source-faithful: _outcome_regression_rc under-scales the control-side OLS estimation-effect term, so panel=False + covariates + estimation_method="reg" reports understated inference. diff_diff/staggered.py:2806 diff_diff/staggered.py:2920 diff_diff/staggered.py:2938 docs/methodology/REGISTRY.md:423. citeturn0view0turn1view0turn1view3
The new repeated-cross-section covariate-adjusted ipw and dr paths likewise under-scale their propensity-score nuisance corrections by averaging moments that are already normalized, which understates analytical SEs and the multiplier bootstrap on those paths. diff_diff/staggered.py:2951 diff_diff/staggered.py:3076 diff_diff/staggered.py:3093 diff_diff/staggered.py:3371 diff_diff/staggered_bootstrap.py:357 docs/methodology/REGISTRY.md:423. citeturn2view0turn4view0turn1view3
REGISTRY.md documents Phase 7a/7b support, but it does not record either RCS IF-scaling difference as an intentional deviation, so these remain unmitigated methodology defects. docs/methodology/REGISTRY.md:419 docs/methodology/REGISTRY.md:423
The new tests mostly assert finiteness or coarse behavior; they do not compare RCS covariate-adjusted IF/SE against DRDID reference values, so the inference bug is currently unguarded. tests/test_staggered_rc.py:163 tests/test_staggered_rc.py:362 tests/test_survey_phase7a.py:60 tests/test_survey_phase7a.py:199

Methodology

Severity: P3. Impact: The previous HonestDiD/bootstrap covariance-source finding looks resolved. Bootstrapped CS fits now discard analytical event_study_vcov before results are stored, so HonestDiD falls back to the bootstrap-compatible diagonal path, and that exact regression is now tested. Concrete fix: None. diff_diff/staggered.py:1323 diff_diff/staggered.py:1777 tests/test_honest_did.py:1283
Severity: P1. Impact: _outcome_regression_rc says it matches DRDID::reg_did_rc, but its control-side OLS estimation-effect term is divided by the treated-mass denominator twice. M1 is already normalized by sum_w_D, then inf_ct / inf_cs divide inf_cont_2_* by sum_w_D again. That shrinks the nuisance-estimation piece of the influence function, so covariate-adjusted repeated-cross-section reg fits understate per-cell analytical SEs and any bootstrap path built from those IFs. Concrete fix: Keep M1 normalized as written and remove the extra / sum_w_D on inf_cont_2_ct and inf_cont_2_cs, or re-port the reg_did_rc IF algebra directly from the reference implementation. diff_diff/staggered.py:2824 diff_diff/staggered.py:2920 diff_diff/staggered.py:2938 diff_diff/staggered_bootstrap.py:357 docs/methodology/REGISTRY.md:423. citeturn0view0turn1view0turn1view3
Severity: P1. Impact: The new RCS ipw and dr nuisance corrections are also mis-scaled. _ipw_estimation_rc normalizes w_ct / w_cs to w_*_norm and then forms M2_rc with np.mean(...), adding an extra 1/n_ct or 1/n_cs; _doubly_robust_rc likewise divides its PS moment by n_all after already normalizing by sum_w_ipw_*. In the standardized RC IPW and locally efficient RC DR references, those PS moments are ratio-of-means terms, not extra sample-size-scaled means. That under-scales the PS correction and therefore understates inference for covariate-adjusted repeated-cross-section ipw and dr, including survey-weighted fits. Concrete fix: Rewrite the PS moment construction to match std_ipw_did_rc / drdid_rc exactly: summed normalized-weight moments for standardized IPW, and ratio-of-means PS moments without the extra / n_all for DR. diff_diff/staggered.py:3022 diff_diff/staggered.py:3076 diff_diff/staggered.py:3111 diff_diff/staggered.py:3371 diff_diff/staggered_bootstrap.py:357 docs/methodology/REGISTRY.md:423. citeturn2view0turn4view0turn1view3

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. These methodology defects are not mitigated by tracking: the nearby CS item in TODO.md is already marked resolved. TODO.md:48 TODO.md:57

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new tests do not verify the magnitude of the RCS nuisance corrections. They check finiteness, positivity, or that different methods return different SEs, which would still pass with the scaling errors above. Concrete fix: Add fixed-data regression tests for panel=False reg / ipw / dr with covariates that compare at least one ATT(g,t) SE or influence-function vector against DRDID reference output, and assert that bootstrap SEs move with the corrected IF magnitude. tests/test_staggered_rc.py:163 tests/test_staggered_rc.py:362 tests/test_survey_phase7a.py:60 tests/test_survey_phase7a.py:199

Path to Approval

Fix _outcome_regression_rc so its OLS estimation-effect correction matches reg_did_rc, specifically by removing the extra treated-mass division after M1 has already been normalized.
Fix _ipw_estimation_rc and _doubly_robust_rc so their PS nuisance moments use the same ratio-of-means scaling as the DRDID repeated-cross-section references.
Add fixed-dataset regression tests for panel=False covariate-adjusted reg, ipw, and dr that check both analytical SEs and bootstrap SEs against reference values.
Keep the HonestDiD bootstrap covariance fix and its regression test; that prior blocker looks resolved.

…s sum not mean - _outcome_regression_rc: M1 denominator changed from sum_w_D to n_all (matching R colMeans convention); inf_cont_2 / sum_w_D then gives correct single normalization by mean_w_D * n_all = sum_w_D - _ipw_estimation_rc: PS M2 uses np.sum/n_all instead of np.mean (which divided by n_ct/n_cs instead of n_all, under-scaling the correction) - _doubly_robust_rc: PS M2 already correct (np.sum/n_all), no change Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-29T13:06:13Z

/ai-review

github-actions · 2026-03-29T13:15:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9ff21a2859e7713ca00a4d0359f855dc272d0b7f

Overall Assessment

Needs changes — highest unmitigated severity: P1.

Static review plus source cross-check only. I could not run pytest here because pytest is not installed in this environment.

Executive Summary

Re-review outcome: the prior HonestDiD/bootstrap covariance finding appears fixed, but both prior repeated-cross-section inference findings remain unresolved.
The new repeated-cross-section reg path still under-scales the control-side OLS nuisance correction, so panel=False + covariates + estimation_method="reg" understates inference.
The new repeated-cross-section ipw and dr paths still under-scale the propensity-score nuisance corrections, so their analytical SEs and multiplier bootstrap remain too small.
REGISTRY.md and TODO.md now mark Phase 7a/7b as resolved, but these remaining variance/IF mismatches are not documented deviations and are not mitigated by tracking.
The new tests mostly check finiteness or coarse behavior; they still do not pin RCS covariate-adjusted IF/SE magnitudes to DRDID reference output.

Methodology

Severity: P3. Impact: the prior HonestDiD/bootstrap covariance blocker looks resolved. fit() now resets stale event_study_vcov state and clears the analytical event-study covariance on bootstrap fits before storing results, with regression coverage in the HonestDiD tests. Concrete fix: none. diff_diff/staggered.py:L1323, diff_diff/staggered.py:L1777, tests/test_honest_did.py:L1283
Severity: P1. Impact: the prior RCS reg scaling finding remains unresolved. In _outcome_regression_rc, the local port still computes M1 = sum(w_D * X) / n_all and then divides the assembled estimation-effect term by sum_w_D, which leaves an extra 1 / n_all shrinkage in the OLS nuisance correction. Relative to DRDID::reg_did_rc, that understates per-cell analytical SEs for panel=False + covariates + estimation_method="reg", and the multiplier bootstrap inherits the same understatement because it perturbs the stored IF vectors directly. Concrete fix: port reg_did_rc literally in the local phi = psi / n convention: either use M1 = sum(w_D * X) with the current / sum_w_D, or keep colMeans(...) and divide by mean_w_D instead of sum_w_D. diff_diff/staggered.py:L2806, diff_diff/staggered.py:L2921, diff_diff/staggered.py:L2939, diff_diff/staggered_bootstrap.py:L372, docs/methodology/REGISTRY.md:L423, TODO.md:L57. citeturn0view2
Severity: P1. Impact: the prior RCS ipw / dr scaling finding also remains unresolved. _ipw_estimation_rc and _doubly_robust_rc normalize the control weights first (w_ct_norm, w_ipw_* / sum_w_ipw_*) and then divide the propensity-score moments by n_all again before multiplying by asy_lin_rep_ps. In DRDID::std_ipw_did_rc and DRDID::drdid_rc, those moments are already normalized by the control-weight means; after translating to the local phi = psi / n convention, there should not be an additional / n_all. The result is understated analytical SEs and understated multiplier-bootstrap dispersion for panel=False covariate-adjusted ipw and dr, including survey-weighted fits. Concrete fix: remove the extra / n_all from the RCS PS moments and port the std_ipw_did_rc / drdid_rc nuisance terms term-for-term. diff_diff/staggered.py:L3071, diff_diff/staggered.py:L3086, diff_diff/staggered.py:L3382, diff_diff/staggered.py:L3397, diff_diff/staggered_bootstrap.py:L372, docs/methodology/REGISTRY.md:L423, TODO.md:L57. citeturn0view1turn0view0

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. The unresolved SE/IF scaling issues above are not mitigated by the new tracking text because incorrect variance output is not deferrable under this review rubric. TODO.md:L57, docs/methodology/REGISTRY.md:L423

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new Phase 7a/7b tests still only assert finiteness, positivity, or that different methods return different SEs. They do not compare RCS covariate-adjusted IF/SE magnitudes to DRDID reference output, so both unresolved scaling bugs above can pass green. Concrete fix: add fixed-data regression tests for panel=False reg/ipw/dr with covariates that assert at least one ATT(g,t) analytical SE and one multiplier-bootstrap SE against DRDID reference output, including an unequal cohort-count case. tests/test_staggered_rc.py:L163, tests/test_staggered_rc.py:L362, tests/test_staggered_rc.py:L442, tests/test_survey_phase7a.py:L60, tests/test_survey_phase7a.py:L199, tests/test_survey_phase7a.py:L283

Path to Approval

Fix _outcome_regression_rc so the control-side OLS nuisance correction matches DRDID::reg_did_rc exactly; the current port still carries an extra / n_all.
Fix _ipw_estimation_rc and _doubly_robust_rc so the PS nuisance moments match DRDID::std_ipw_did_rc and DRDID::drdid_rc exactly; remove the extra / n_all after control-weight normalization.
Add fixed-data regression tests for RCS reg / ipw / dr with covariates that check at least one analytical ATT(g,t) SE and one multiplier-bootstrap SE against DRDID reference output, including an unequal cohort-count case.

Conversation

igerber commented Mar 28, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 28, 2026

PR Review

Uh oh!

igerber commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

PR Review

Uh oh!

igerber commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

PR Review

Uh oh!

igerber commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

igerber commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

igerber commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

igerber commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant