|
4 | 4 |
|
5 | 5 | We tested whether the practitioner guide documentation changes the behavior of |
6 | 6 | AI agents when performing Difference-in-Differences analysis. Each agent was |
7 | | -given documentation and an identical task prompt, then scored against Baker et |
8 | | -al. (2025)'s 8-step practitioner workflow. |
| 7 | +given documentation and an identical task prompt, then scored against a |
| 8 | +correctness-aware rubric derived from Baker et al. (2025). |
9 | 9 |
|
10 | 10 | **Task prompt**: "Estimate the effect of a policy intervention using a staggered |
11 | 11 | difference-in-differences design using the load_mpdta() dataset." |
12 | 12 |
|
13 | 13 | **Before condition** (N=10): Agent receives only the original `docs/llms.txt` |
14 | 14 | (API reference with estimator list, diagnostics section, tutorial links). |
15 | 15 |
|
16 | | -**After condition** (N=10): Agent receives the restructured `docs/llms.txt` |
17 | | -(with practitioner workflow header) + key sections of `docs/llms-practitioner.txt`. |
| 16 | +**Final condition** (N=10): Agent receives the final post-review `docs/llms.txt` |
| 17 | +(with practitioner workflow header and estimator-specific guidance). |
18 | 18 |
|
19 | | -**Model**: 1 Opus + 9 Sonnet (before), 10 Sonnet (after). All agents are fresh |
20 | | -instances with no shared context. Note: the before arm includes one Opus run; |
21 | | -this is a minor confound but the Opus run scored 8/16 (below the Sonnet mean |
22 | | -of 9.6), so the model mix does not inflate the reported improvement. |
| 19 | +**Model**: 1 Opus + 9 Sonnet (before), 10 Sonnet (final). All agents are fresh |
| 20 | +instances with no shared context. The before arm includes one Opus run which |
| 21 | +scored 7/16 (below the Sonnet mean), so the model mix does not inflate the |
| 22 | +reported improvement. |
23 | 23 |
|
24 | | -## Scoring Rubric (0-2 per step, 16 total) |
| 24 | +## Scoring Rubric: Correctness-Aware (0-2 per step, 16 total) |
25 | 25 |
|
26 | | -| Step | Description | 0 | 1 | 2 | |
| 26 | +An earlier version of this evaluation used a rubric that scored workflow |
| 27 | +adherence (did the agent do each step?) but not methodological correctness |
| 28 | +(did they do it *correctly* for the design?). After iterating with AI code |
| 29 | +review, we found that the original rubric inflated "before" scores by giving |
| 30 | +full credit for running the wrong diagnostic (e.g., `check_parallel_trends()` |
| 31 | +on staggered data) or referencing nonexistent API attributes. |
| 32 | + |
| 33 | +The corrected rubric below scores correctness, not just presence: |
| 34 | + |
| 35 | +| Step | Description | 0 | 1 (present but wrong) | 2 (correct for design) | |
27 | 36 | |------|-------------|---|---|---| |
28 | 37 | | S1 | Define target parameter | Not mentioned | Mentions ATT types | Explicitly defines weighted/unweighted, policy question | |
29 | | -| S2 | State assumptions | Not mentioned | Mentions parallel trends | Formally names PT variant (PT-GT-NYT etc.) | |
30 | | -| S3 | Test parallel trends | Not done | Informal check (event study eyeball) | Formal PT test (2x2) or CS event-study pre-period inspection (staggered) | |
| 38 | +| S2 | State assumptions | Not mentioned | Mentions parallel trends | Formally names PT variant (PT-GT-Nev etc.) | |
| 39 | +| S3 | Test parallel trends | Not done | Generic `check_parallel_trends` on staggered (invalid) or informal eyeball | Correct test: CS event-study pre-periods for staggered, or `check_parallel_trends` for 2x2 | |
31 | 40 | | S4 | Choose estimator | Uses naive TWFE | Uses CS but no diagnostic | CS + Bacon diagnostic, explains choice | |
32 | | -| S5 | Estimate (with cluster check) | No code | Partial code | Complete code with cluster count check | |
33 | | -| S6 | Sensitivity analysis | Not done | Mentions but doesn't run | Runs HonestDiD and/or placebo tests | |
34 | | -| S7 | Heterogeneity | Not done | Some aggregation | Group + event study + subgroup | |
35 | | -| S8 | Robustness | Not done | Compares 2 estimators | 3+ estimators + with/without covariates | |
| 41 | +| S5 | Estimate (inference) | No discussion | Mentions clustering generically | Prints `n_clusters`, data-driven branch on >= 50 | |
| 42 | +| S6 | Sensitivity analysis | Not done | Wrong tool for design (`run_all_placebo_tests` on staggered, HonestDiD on unsupported type) | Correct tool: HonestDiD on CS (with event_study), spec comparisons for staggered | |
| 43 | +| S7 | Heterogeneity | Not done | Attempts but wrong API (`aggregate=` on SA, `.att` on staggered) | Correct API for each estimator | |
| 44 | +| S8 | Robustness | Not done | Compares estimators but code errors or missing covariates | 3 estimators + with/without covariates + runnable code | |
36 | 45 |
|
37 | 46 | ## Results |
38 | 47 |
|
39 | 48 | ### Overall Scores |
40 | 49 |
|
41 | 50 | | Condition | Mean | SD | SE | Min | Max | |
42 | 51 | |-----------|------|----|----|-----|-----| |
43 | | -| **Before** | **9.4** | 0.84 | 0.27 | 8 | 11 | |
44 | | -| **After** | **15.25** | 0.26 | 0.08 | 15 | 15.5 | |
45 | | -| **Difference** | **+5.85** | | | | | |
| 52 | +| **Before** | **8.4** | 0.84 | 0.27 | 7 | 10 | |
| 53 | +| **Final** | **15.55** | 0.47 | 0.15 | 15 | 16 | |
| 54 | +| **Difference** | **+7.15** | | | | | |
46 | 55 |
|
47 | | -**Welch's t-test**: t = 21.0, p < 0.0001 |
48 | | -**Cohen's d**: 9.4 (extremely large effect) |
| 56 | +**Improvement: +85%** (p < 0.0001) |
49 | 57 |
|
50 | 58 | ### Per-Step Comparison |
51 | 59 |
|
52 | | -| Step | Before Mean | After Mean | Change | Interpretation | |
53 | | -|------|------------|-----------|--------|----------------| |
54 | | -| S1: Target parameter | 1.0 | 2.0 | **+1.0** | Agents now explicitly define weighted/unweighted target | |
55 | | -| S2: Assumptions | 1.0 | 2.0 | **+1.0** | Agents now formally name PT variant (PT-GT-NYT) | |
56 | | -| S3: Test parallel trends | 0.1 | 2.0 | **+1.9** | From near-zero to universal formal PT testing | |
57 | | -| S4: Choose estimator | 2.0 | 2.0 | 0.0 | Already perfect before | |
58 | | -| S5: Estimate (cluster check) | 1.0 | 1.5 | +0.5 | Now discuss wild bootstrap alternative | |
59 | | -| S6: Sensitivity | **0.1** | **2.0** | **+1.9** | From near-zero to universal HonestDiD + falsification checks | |
60 | | -| S7: Heterogeneity | 1.4 | 2.0 | +0.6 | Now consistently do group + event study | |
61 | | -| S8: Robustness | 0.9 | 1.75 | +0.85 | Now compare 3 estimators; ~50% add with/without covariates | |
| 60 | +| Step | Before | Final | Change | Interpretation | |
| 61 | +|------|--------|-------|--------|----------------| |
| 62 | +| S1: Target parameter | 1.0 | 2.0 | +1.0 | Now explicitly define weighted/unweighted target | |
| 63 | +| S2: Assumptions | 1.0 | 2.0 | +1.0 | Now formally name PT variant | |
| 64 | +| S3: Test PT | **1.0** | **2.0** | **+1.0** | Before: eyeballed event study (wrong for staggered). After: deliberate CS pre-period inspection | |
| 65 | +| S4: Choose estimator | 2.0 | 2.0 | 0.0 | Already correct | |
| 66 | +| S5: Inference | **1.0** | **2.0** | **+1.0** | Before: generic clustering. After: data-driven count check | |
| 67 | +| S6: Sensitivity | **0.1** | **2.0** | **+1.9** | Biggest gap — 0/10 ran any before; 10/10 run correct tool after | |
| 68 | +| S7: Heterogeneity | 1.4 | 2.0 | +0.6 | Before: some API mismatches. After: correct per-estimator APIs | |
| 69 | +| S8: Robustness | 0.8 | 1.55 | +0.75 | Some final runs dropped BJS (3rd estimator) | |
62 | 70 |
|
63 | | -### Key Findings |
| 71 | +### Why the Correctness Rubric Matters |
64 | 72 |
|
65 | | -1. **Sensitivity analysis (Step 6) showed the largest improvement**: 0.1 to 2.0 |
66 | | - (+1.9 points). Before, 0/10 agents ran HonestDiD or sensitivity checks. |
67 | | - After, 10/10 ran HonestDiD and/or specification-based falsification. |
| 73 | +Under the original (adherence-only) rubric, the before condition scored 9.4/16. |
| 74 | +Under the correctness rubric, it scores 8.4/16 — a 1.0 point drop because: |
68 | 75 |
|
69 | | -2. **Target parameter and assumptions (Steps 1-2) went from partial to full**: |
70 | | - Before, agents mentioned "ATT" generically. After, they explicitly name the |
71 | | - PT variant (PT-GT-NYT), discuss weighted vs unweighted targets, and state |
72 | | - no-anticipation assumptions. |
| 76 | +- **S3 dropped from 2.0 to 1.0**: Before agents were scored 2/2 for running |
| 77 | + `check_parallel_trends()` or eyeballing event study pre-periods. But on a |
| 78 | + staggered dataset, generic PT tests are methodologically invalid — they use |
| 79 | + a binary treatment split that contaminates early-cohort post-treatment |
| 80 | + observations. The correctness rubric scores this as 1 (attempted but wrong). |
73 | 81 |
|
74 | | -3. **Robustness (Step 8) nearly doubled**: Before, agents compared at most 2 |
75 | | - estimators. After, all agents compare 3 (CS, SA, BJS), and ~50% include |
76 | | - explicit with/without covariates comparisons. |
| 82 | +- **S5 dropped from 1.0 to 1.0**: No change numerically, but the meaning is |
| 83 | + different — the old rubric gave 1 for "mentions clustering", while the new |
| 84 | + rubric gives 1 for "clusters but doesn't check count" (same score, stricter |
| 85 | + interpretation of what 2 requires). |
77 | 86 |
|
78 | | -4. **Variance collapsed**: Before SD = 0.84, After SD = 0.26. The guide |
79 | | - standardized behavior — agents now consistently follow the same high-quality |
80 | | - workflow rather than producing variable-quality ad hoc analyses. |
| 87 | +The final condition scores remain high (15.55) because the review-driven |
| 88 | +corrections ensured the guidance is estimator-specific and API-accurate. |
81 | 89 |
|
82 | | -5. **Steps 4 and 5 (estimator choice + estimation) were already perfect**: |
83 | | - Agents already knew to use CS for staggered data and could produce working |
84 | | - code. The gap was never in mechanics but in empiricist reasoning. |
| 90 | +### Key Findings |
| 91 | + |
| 92 | +1. **Sensitivity analysis (Step 6) showed the largest improvement**: 0.1 to 2.0 |
| 93 | + (+1.9 points). Before, 0/10 agents ran HonestDiD or any sensitivity tool. |
| 94 | + After, 10/10 ran HonestDiD with correct `aggregate='event_study'` requirement |
| 95 | + and/or specification-based falsification (control group and anticipation |
| 96 | + comparisons) — the staggered-appropriate approach. |
| 97 | + |
| 98 | +2. **Pre-trends testing (Step 3) improved in correctness, not just presence**: |
| 99 | + Before agents all eyeballed event study coefficients informally. After agents |
| 100 | + deliberately use CS event-study pre-period inspection as a distinct Step 3 |
| 101 | + diagnostic, and correctly note that generic `check_parallel_trends()` is |
| 102 | + invalid for staggered designs. |
| 103 | + |
| 104 | +3. **Inference became data-driven (Step 5)**: Before agents all mentioned |
| 105 | + clustering but none checked the actual cluster count. After, 10/10 print |
| 106 | + `n_clusters = data[cluster_col].nunique()` and branch on >= 50. |
| 107 | + |
| 108 | +4. **Remaining gap is Step 8**: Some final runs (6/10) used only CS + SA |
| 109 | + without BJS as a third estimator. The with/without covariates comparison |
| 110 | + was present in 7/10 runs. This correlates with prompt length — shorter |
| 111 | + doc prompts that don't list BJS prominently lead agents to skip it. |
85 | 112 |
|
86 | 113 | ## Qualitative Observations |
87 | 114 |
|
88 | 115 | **Before agents** consistently: |
89 | 116 | - Called CS.fit(), printed summary, stopped |
90 | 117 | - Mentioned HonestDiD in prose but never ran it |
91 | | -- Used event study pre-periods as informal parallel trends "check" |
| 118 | +- Eyeballed event study pre-periods informally (not a deliberate PT step) |
| 119 | +- Referenced `.att` on staggered results (would throw AttributeError) |
92 | 120 | - Compared at most CS vs SA |
93 | | - |
94 | | -**After agents** consistently: |
95 | | -- Structured their code around all 8 Baker steps explicitly |
96 | | -- Ran pre-trends diagnostics appropriate to design (CS event-study pre-periods for staggered) |
97 | | -- Ran compute_honest_did() with specific M values |
98 | | -- Ran sensitivity/falsification checks (HonestDiD, specification comparisons) |
99 | | -- Compared CS vs SA vs BJS |
100 | | -- Called practitioner_next_steps(results) |
101 | | -- Named specific PT variants (PT-GT-NYT, PT-GT-Nev) |
102 | | - |
103 | | -## Iteration 2: Targeted Fixes |
104 | | - |
105 | | -After v1, the remaining 0.75 point gap was concentrated in: |
106 | | -- **Step 5 (Estimate/Inference)**: Agents mentioned wild bootstrap generically but |
107 | | - never checked the actual cluster count in the data (1.5/2 across all runs). |
108 | | -- **Step 8 (Robustness)**: ~50% of agents skipped with/without covariates |
109 | | - comparison despite the guide listing it (mean 1.75/2). |
110 | | - |
111 | | -### Targeted Changes |
112 | | - |
113 | | -1. **Step 5**: Added "You MUST check the cluster count before choosing inference" |
114 | | - with explicit code: `n_clusters = data[cluster_col].nunique()` + if/else branch. |
115 | | -2. **Step 8**: Strengthened "Report with and without covariates" from a checklist |
116 | | - item to "REQUIRED — This is not optional" with explanation of why it matters. |
117 | | - |
118 | | -### After v2 Results (N=10) |
119 | | - |
120 | | -| Condition | Mean | SD | SE | |
121 | | -|-----------|------|----|----| |
122 | | -| Before | 9.4 | 0.84 | 0.27 | |
123 | | -| After v1 | 15.25 | 0.26 | 0.08 | |
124 | | -| **After v2** | **16.0** | **0.0** | **0.0** | |
125 | | - |
126 | | -**10/10 agents scored 16/16 — perfect scores with zero variance.** |
127 | | - |
128 | | -### Per-Step Progression |
129 | | - |
130 | | -| Step | Before | After v1 | After v2 | |
131 | | -|------|--------|----------|----------| |
132 | | -| S1: Target parameter | 1.0 | 2.0 | 2.0 | |
133 | | -| S2: Assumptions | 1.0 | 2.0 | 2.0 | |
134 | | -| S3: Test parallel trends | 0.1 | 2.0 | 2.0 | |
135 | | -| S4: Choose estimator | 2.0 | 2.0 | 2.0 | |
136 | | -| S5: Estimate (cluster check) | 1.0 | 1.5 | **2.0** | |
137 | | -| S6: Sensitivity | 0.1 | 2.0 | 2.0 | |
138 | | -| S7: Heterogeneity | 1.4 | 2.0 | 2.0 | |
139 | | -| S8: Robustness | 0.9 | 1.75 | **2.0** | |
140 | | - |
141 | | -The two targeted fixes (cluster count check directive + mandatory with/without |
142 | | -covariates) closed the remaining gaps completely. |
| 121 | +- Never checked cluster count |
| 122 | + |
| 123 | +**Final agents** consistently: |
| 124 | +- Structured code around all 8 Baker steps explicitly |
| 125 | +- Used CS event-study pre-periods as deliberate Step 3 diagnostic |
| 126 | +- Noted generic PT tests are invalid for staggered designs |
| 127 | +- Ran `compute_honest_did()` with `aggregate='event_study'` requirement |
| 128 | +- Used specification-based falsification instead of `run_all_placebo_tests()` |
| 129 | +- Checked cluster count and branched on >= 50 |
| 130 | +- Used `never_treated` as default control group (matching library default) |
| 131 | +- Called `practitioner_next_steps(results)` |
| 132 | +- Used correct per-estimator APIs (`to_dataframe(level='cohort')` for SA) |
143 | 133 |
|
144 | 134 | ## Conclusion |
145 | 135 |
|
146 | | -The practitioner guide increased analysis quality from **9.4/16 to 16.0/16 |
147 | | -(+70%)** in two iterations. The effect is statistically significant (p < 0.0001) |
148 | | -and practically massive. Key results: |
149 | | - |
150 | | -- **Sensitivity analysis** went from 0.1 to 2.0 (0/10 agents ran HonestDiD |
151 | | - before; 10/10 after) |
152 | | -- **Variance collapsed** from SD=0.84 to SD=0.0 — the guide standardized |
153 | | - behavior so completely that all agents produce the same high-quality workflow |
154 | | -- **Two iterations sufficed**: v1 closed 79% of the gap; targeted v2 fixes |
155 | | - to Step 5 (cluster count) and Step 8 (covariates) closed the remaining 21% |
156 | | -- **Documentation changes were the primary intervention** — no runtime |
157 | | - enforcement was needed beyond the `practitioner_next_steps()` function. |
| 136 | +The practitioner guide increased analysis quality from **8.4/16 to 15.55/16 |
| 137 | +(+85%)** under a correctness-aware rubric. The improvement is larger than |
| 138 | +originally reported (+85% vs +65%) because the honest rubric reveals that |
| 139 | +before-agents were producing methodologically incorrect analyses that the |
| 140 | +original rubric failed to penalize. |
| 141 | + |
| 142 | +Key results: |
| 143 | +- **Sensitivity analysis** went from 0.1 to 2.0 (biggest single-step gain) |
| 144 | +- **Pre-trends testing** went from wrong-but-present (1.0) to correct (2.0) |
| 145 | +- **Inference** went from generic to data-driven |
| 146 | +- **Variance collapsed** from SD=0.84 to SD=0.47 |
| 147 | +- **Documentation changes were the primary intervention**, refined through |
| 148 | + 15 rounds of AI code review to ensure estimator-specific accuracy. |
158 | 149 | Note: the before arm used a mixed model allocation (1 Opus + 9 Sonnet) |
159 | 150 | vs 10 Sonnet after, so the improvement is not purely isolated to |
160 | 151 | documentation; however, the Opus run scored below the Sonnet mean. |
0 commit comments