Clean, reproducible research pipeline for an applied micro / health & labour project on UKHLS data. The pipeline identifies latent health types via K-Means clustering on frailty trajectories, fits HC3 OLS regressions with full diagnostics, and emits publication-ready tables and figures that the paper and slide deck consume directly.
The repo is managed with uv:
uv syncThis creates .venv/, pins Python to >=3.11, and installs every dependency
listed in pyproject.toml against the locked uv.lock.
UKHLS microdata is restricted and gitignored:
| Path | Contents | Source |
|---|---|---|
data/ukhls/{a..m}_indresp.dta |
Raw UKHLS Stata files | UK Data Service licence |
data/raw/frailty_long_panel.parquet |
Per-(pidp, wave) frailty index, age, death, raw hcond* flags |
notebooks/archive/frailty_main.ipynb |
data/raw/ukhls_demographic_panel.parquet |
Per-(pidp, wave) sex, education, labour-force status, hourly pay | python -m src.pipeline.ingest_ukhls |
data/raw/death_data_long_panel.parquet |
Death records | UKHLS death linkage |
data/derived/ |
Intermediate pipeline outputs (regenerated each run) | ./run.sh |
The .dta directory can be deleted once ukhls_demographic_panel.parquet
exists; the pipeline reads only the parquet artefacts.
If data/raw/ukhls_demographic_panel.parquet doesn't exist yet, build it from
the UKHLS Stata files:
uv run python -m src.pipeline.ingest_ukhls \
--ukhls-dir data/ukhls \
--output data/raw/ukhls_demographic_panel.parquetThis is a one-shot conversion (~30s on first run, 5 MB output). Once it's
done, you can rm -rf data/ukhls/.
./run.sh
# equivalent to
uv run python -m src.cli --config configs/config.yamlPipeline steps (defined in src/cli.py):
ingest— reads frailty + demographic parquets, merges on(pidp, wave), renames keys, converts wave letters to integers, computes lagged frailty.cluster— K-Means on per-individual frailty trajectories at ages 50–60 (k=3 by default), labels mapped back to the full long panel.estimate— OLS of frailty on demographic controls, with cluster dummies. HC3 SEs, joint Wald F-test, Breusch–Pagan, Durbin–Watson, VIFs.report— generates 5 tables (CSV + booktabs.tex) and 11 figures, all driven by src/analysis/figures.py and src/analysis/tables.py.
Tables (output/tables/ — both .csv and .tex):
| File | Contents |
|---|---|
tab01_summary_stats |
Counts, means, share-of-sample by health type |
tab02_frailty_by_wave |
Mean frailty by wave × type |
tab03_main_regression |
Stargazer-rendered basic vs. full OLS table |
tab04_employment_by_type |
Employment / unemployment / inactivity shares |
tab05_education_by_type |
Educational attainment shares (ages 50–60) |
Figures (output/figures/, all rendered with the custom paper style in
src/analysis/_style.py):
| File | Section | Contents |
|---|---|---|
fig01_frailty_trajectories.png |
Appx B | Frailty trajectories by wave |
fig02_frailty_distribution.png |
Appx B | Within-cluster frailty distributions |
fig03_cluster_diagnostics.png |
§3 | Elbow + silhouette twin-axis |
fig04_frailty_by_age.png |
§4.1 | Binned mean frailty by age × type (headline) |
fig05_employment_by_type.png |
§4.2 | Employment rate by age × type |
fig06_earnings_by_type.png |
§4.2 | Mean hourly pay by age × type (95% CIs) |
fig07_frailty_by_age_scatter.png |
§2 | Binned mean frailty by age, full panel |
fig08_education_by_type.png |
§4.3 | Stacked-bar education shares by type |
fig09_pay_by_education.png |
§4.3 | Mean hourly pay by qualification |
fig10_healthcond_variables_by_wave.png |
§2 | UKHLS variable-family wave coverage |
fig11_mortality_by_age.png |
§2 | Mortality (frailty=1) by age |
Metrics: output/metrics/metrics.json — appended per-run with run-id, git
commit, cluster counts, regression diagnostics.
Log: output/logs/pipeline.log.
- LaTeX sources: paper/tex/ — working-paper layout with title page (JEL J14, J21, J31, I12, C38; keywords), abstract, body sections 1–7, lettered appendices A–C.
- Build:
./paper/build.sh→ paper/final/final.pdf (~21 pages, 1.5 MB). Auto-runs./run.shfirst if any required figure/table is missing. - Slide deck: paper/slides/seminar.tex — 24-frame Beamer deck for a ~30-min seminar talk. Built via
./paper/slides/build.sh→ paper/final/slides/seminar.pdf. Uses theprimary-blue/primary-goldpalette from Preambles/header.tex.
Both build scripts run the full pipeline first if any required artefact is missing, so a fresh clone with UKHLS data in place is one command away from a compiled PDF.
uv run pytest -qThe test suite (tests/test_pipeline.py) takes a 200-individual subsample of the frailty panel and runs the pipeline end-to-end against it, asserting schema, no-duplicate keys, valid frailty range, and expected outputs. When the UKHLS data isn't present (e.g.\ in CI on a public runner) the suite cleanly skips every test instead of erroring, so the build stays green.
.github/workflows/ci.yml runs a minimal smoke check on every push and PR, against Python 3.11 and 3.12:
uv sync— catches dependency drift / lock-file inconsistencies.- Import smoke — imports every module under
src/to catch syntax errors and broken cross-module references. pytest -q— collects the suite. Tests skip in CI because UKHLS data isn't available; what's verified is that pytest collection succeeds and the test code itself parses.
The CI is honest about what it can and can't check given a restricted-data project: dependency / import / syntax health, not full reproduction.
- Notebooks are archived in notebooks/archive/ for provenance only; the pipeline does not depend on them.
- All figures are generated with the custom matplotlib rcParams in
src/analysis/_style.py — no
plt.style.use(...)of any built-in theme. - The regression LaTeX table is rendered via Stargazer in
src/pipeline/estimate.py; descriptive tables
use
pandas.to_latex()wrapped in booktabs in src/analysis/tables.py.