predoc-coding-sample

Clean, reproducible research pipeline for an applied micro / health & labour project on UKHLS data. The pipeline identifies latent health types via K-Means clustering on frailty trajectories, fits HC3 OLS regressions with full diagnostics, and emits publication-ready tables and figures that the paper and slide deck consume directly.

Setup

The repo is managed with uv:

uv sync

This creates .venv/, pins Python to >=3.11, and installs every dependency listed in pyproject.toml against the locked uv.lock.

Data

UKHLS microdata is restricted and gitignored:

Path	Contents	Source
`data/ukhls/{a..m}_indresp.dta`	Raw UKHLS Stata files	UK Data Service licence
`data/raw/frailty_long_panel.parquet`	Per-(pidp, wave) frailty index, age, death, raw `hcond*` flags	`notebooks/archive/frailty_main.ipynb`
`data/raw/ukhls_demographic_panel.parquet`	Per-(pidp, wave) sex, education, labour-force status, hourly pay	`python -m src.pipeline.ingest_ukhls`
`data/raw/death_data_long_panel.parquet`	Death records	UKHLS death linkage
`data/derived/`	Intermediate pipeline outputs (regenerated each run)	`./run.sh`

The .dta directory can be deleted once ukhls_demographic_panel.parquet exists; the pipeline reads only the parquet artefacts.

One-time setup: build the demographic panel

If data/raw/ukhls_demographic_panel.parquet doesn't exist yet, build it from the UKHLS Stata files:

uv run python -m src.pipeline.ingest_ukhls \
  --ukhls-dir data/ukhls \
  --output data/raw/ukhls_demographic_panel.parquet

This is a one-shot conversion (~30s on first run, 5 MB output). Once it's done, you can rm -rf data/ukhls/.

Run the pipeline

./run.sh
# equivalent to
uv run python -m src.cli --config configs/config.yaml

Pipeline steps (defined in src/cli.py):

ingest — reads frailty + demographic parquets, merges on (pidp, wave), renames keys, converts wave letters to integers, computes lagged frailty.
cluster — K-Means on per-individual frailty trajectories at ages 50–60 (k=3 by default), labels mapped back to the full long panel.
estimate — OLS of frailty on demographic controls, with cluster dummies. HC3 SEs, joint Wald F-test, Breusch–Pagan, Durbin–Watson, VIFs.
report — generates 5 tables (CSV + booktabs .tex) and 11 figures, all driven by src/analysis/figures.py and src/analysis/tables.py.

Outputs

Tables (output/tables/ — both .csv and .tex):

File	Contents
`tab01_summary_stats`	Counts, means, share-of-sample by health type
`tab02_frailty_by_wave`	Mean frailty by wave × type
`tab03_main_regression`	Stargazer-rendered basic vs. full OLS table
`tab04_employment_by_type`	Employment / unemployment / inactivity shares
`tab05_education_by_type`	Educational attainment shares (ages 50–60)

Figures (output/figures/, all rendered with the custom paper style in src/analysis/_style.py):

File	Section	Contents
`fig01_frailty_trajectories.png`	Appx B	Frailty trajectories by wave
`fig02_frailty_distribution.png`	Appx B	Within-cluster frailty distributions
`fig03_cluster_diagnostics.png`	§3	Elbow + silhouette twin-axis
`fig04_frailty_by_age.png`	§4.1	Binned mean frailty by age × type (headline)
`fig05_employment_by_type.png`	§4.2	Employment rate by age × type
`fig06_earnings_by_type.png`	§4.2	Mean hourly pay by age × type (95% CIs)
`fig07_frailty_by_age_scatter.png`	§2	Binned mean frailty by age, full panel
`fig08_education_by_type.png`	§4.3	Stacked-bar education shares by type
`fig09_pay_by_education.png`	§4.3	Mean hourly pay by qualification
`fig10_healthcond_variables_by_wave.png`	§2	UKHLS variable-family wave coverage
`fig11_mortality_by_age.png`	§2	Mortality (frailty=1) by age

Metrics: output/metrics/metrics.json — appended per-run with run-id, git commit, cluster counts, regression diagnostics.

Log: output/logs/pipeline.log.

Paper

LaTeX sources: paper/tex/ — working-paper layout with title page (JEL J14, J21, J31, I12, C38; keywords), abstract, body sections 1–7, lettered appendices A–C.
Build: ./paper/build.sh → paper/final/final.pdf (~21 pages, 1.5 MB). Auto-runs ./run.sh first if any required figure/table is missing.
Slide deck: paper/slides/seminar.tex — 24-frame Beamer deck for a ~30-min seminar talk. Built via ./paper/slides/build.sh → paper/final/slides/seminar.pdf. Uses the primary-blue/primary-gold palette from Preambles/header.tex.

Both build scripts run the full pipeline first if any required artefact is missing, so a fresh clone with UKHLS data in place is one command away from a compiled PDF.

Tests

uv run pytest -q

The test suite (tests/test_pipeline.py) takes a 200-individual subsample of the frailty panel and runs the pipeline end-to-end against it, asserting schema, no-duplicate keys, valid frailty range, and expected outputs. When the UKHLS data isn't present (e.g.\ in CI on a public runner) the suite cleanly skips every test instead of erroring, so the build stays green.

Continuous integration

.github/workflows/ci.yml runs a minimal smoke check on every push and PR, against Python 3.11 and 3.12:

uv sync — catches dependency drift / lock-file inconsistencies.
Import smoke — imports every module under src/ to catch syntax errors and broken cross-module references.
pytest -q — collects the suite. Tests skip in CI because UKHLS data isn't available; what's verified is that pytest collection succeeds and the test code itself parses.

The CI is honest about what it can and can't check given a restricted-data project: dependency / import / syntax health, not full reproduction.

Notes

Notebooks are archived in notebooks/archive/ for provenance only; the pipeline does not depend on them.
All figures are generated with the custom matplotlib rcParams in src/analysis/_style.py — no plt.style.use(...) of any built-in theme.
The regression LaTeX table is rendered via Stargazer in src/pipeline/estimate.py; descriptive tables use pandas.to_latex() wrapped in booktabs in src/analysis/tables.py.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.claude		.claude
.github/workflows		.github/workflows
.vscode		.vscode
assets/reference_figures		assets/reference_figures
configs		configs
docs		docs
notebooks/archive		notebooks/archive
output		output
paper		paper
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
run.sh		run.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

predoc-coding-sample

Setup

Data

One-time setup: build the demographic panel

Run the pipeline

Outputs

Paper

Tests

Continuous integration

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

predoc-coding-sample

Setup

Data

One-time setup: build the demographic panel

Run the pipeline

Outputs

Paper

Tests

Continuous integration

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages