Skip to content

gavinjqu/predoc-coding-sample

Repository files navigation

predoc-coding-sample

Clean, reproducible research pipeline for an applied micro / health & labour project on UKHLS data. The pipeline identifies latent health types via K-Means clustering on frailty trajectories, fits HC3 OLS regressions with full diagnostics, and emits publication-ready tables and figures that the paper and slide deck consume directly.

smoke

Setup

The repo is managed with uv:

uv sync

This creates .venv/, pins Python to >=3.11, and installs every dependency listed in pyproject.toml against the locked uv.lock.

Data

UKHLS microdata is restricted and gitignored:

Path Contents Source
data/ukhls/{a..m}_indresp.dta Raw UKHLS Stata files UK Data Service licence
data/raw/frailty_long_panel.parquet Per-(pidp, wave) frailty index, age, death, raw hcond* flags notebooks/archive/frailty_main.ipynb
data/raw/ukhls_demographic_panel.parquet Per-(pidp, wave) sex, education, labour-force status, hourly pay python -m src.pipeline.ingest_ukhls
data/raw/death_data_long_panel.parquet Death records UKHLS death linkage
data/derived/ Intermediate pipeline outputs (regenerated each run) ./run.sh

The .dta directory can be deleted once ukhls_demographic_panel.parquet exists; the pipeline reads only the parquet artefacts.

One-time setup: build the demographic panel

If data/raw/ukhls_demographic_panel.parquet doesn't exist yet, build it from the UKHLS Stata files:

uv run python -m src.pipeline.ingest_ukhls \
  --ukhls-dir data/ukhls \
  --output data/raw/ukhls_demographic_panel.parquet

This is a one-shot conversion (~30s on first run, 5 MB output). Once it's done, you can rm -rf data/ukhls/.

Run the pipeline

./run.sh
# equivalent to
uv run python -m src.cli --config configs/config.yaml

Pipeline steps (defined in src/cli.py):

  1. ingest — reads frailty + demographic parquets, merges on (pidp, wave), renames keys, converts wave letters to integers, computes lagged frailty.
  2. cluster — K-Means on per-individual frailty trajectories at ages 50–60 (k=3 by default), labels mapped back to the full long panel.
  3. estimate — OLS of frailty on demographic controls, with cluster dummies. HC3 SEs, joint Wald F-test, Breusch–Pagan, Durbin–Watson, VIFs.
  4. report — generates 5 tables (CSV + booktabs .tex) and 11 figures, all driven by src/analysis/figures.py and src/analysis/tables.py.

Outputs

Tables (output/tables/ — both .csv and .tex):

File Contents
tab01_summary_stats Counts, means, share-of-sample by health type
tab02_frailty_by_wave Mean frailty by wave × type
tab03_main_regression Stargazer-rendered basic vs. full OLS table
tab04_employment_by_type Employment / unemployment / inactivity shares
tab05_education_by_type Educational attainment shares (ages 50–60)

Figures (output/figures/, all rendered with the custom paper style in src/analysis/_style.py):

File Section Contents
fig01_frailty_trajectories.png Appx B Frailty trajectories by wave
fig02_frailty_distribution.png Appx B Within-cluster frailty distributions
fig03_cluster_diagnostics.png §3 Elbow + silhouette twin-axis
fig04_frailty_by_age.png §4.1 Binned mean frailty by age × type (headline)
fig05_employment_by_type.png §4.2 Employment rate by age × type
fig06_earnings_by_type.png §4.2 Mean hourly pay by age × type (95% CIs)
fig07_frailty_by_age_scatter.png §2 Binned mean frailty by age, full panel
fig08_education_by_type.png §4.3 Stacked-bar education shares by type
fig09_pay_by_education.png §4.3 Mean hourly pay by qualification
fig10_healthcond_variables_by_wave.png §2 UKHLS variable-family wave coverage
fig11_mortality_by_age.png §2 Mortality (frailty=1) by age

Metrics: output/metrics/metrics.json — appended per-run with run-id, git commit, cluster counts, regression diagnostics.

Log: output/logs/pipeline.log.

Paper

  • LaTeX sources: paper/tex/ — working-paper layout with title page (JEL J14, J21, J31, I12, C38; keywords), abstract, body sections 1–7, lettered appendices A–C.
  • Build: ./paper/build.shpaper/final/final.pdf (~21 pages, 1.5 MB). Auto-runs ./run.sh first if any required figure/table is missing.
  • Slide deck: paper/slides/seminar.tex — 24-frame Beamer deck for a ~30-min seminar talk. Built via ./paper/slides/build.shpaper/final/slides/seminar.pdf. Uses the primary-blue/primary-gold palette from Preambles/header.tex.

Both build scripts run the full pipeline first if any required artefact is missing, so a fresh clone with UKHLS data in place is one command away from a compiled PDF.

Tests

uv run pytest -q

The test suite (tests/test_pipeline.py) takes a 200-individual subsample of the frailty panel and runs the pipeline end-to-end against it, asserting schema, no-duplicate keys, valid frailty range, and expected outputs. When the UKHLS data isn't present (e.g.\ in CI on a public runner) the suite cleanly skips every test instead of erroring, so the build stays green.

Continuous integration

.github/workflows/ci.yml runs a minimal smoke check on every push and PR, against Python 3.11 and 3.12:

  1. uv sync — catches dependency drift / lock-file inconsistencies.
  2. Import smoke — imports every module under src/ to catch syntax errors and broken cross-module references.
  3. pytest -q — collects the suite. Tests skip in CI because UKHLS data isn't available; what's verified is that pytest collection succeeds and the test code itself parses.

The CI is honest about what it can and can't check given a restricted-data project: dependency / import / syntax health, not full reproduction.

Notes

About

Reproducible data pipeline (Python) for my thesis —health types and labor market outcomes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors