Build a harmonized DuckDB database from 40+ years of the Home Mortgage Disclosure Act (HMDA) data in a single command. Modeled after Paul Goldsmith-Pinkham's IPEDS database pipeline.
uv sync
uv run python build_database.py --freshThe result is hmda.duckdb (~27 GB) containing 685 million rows across four
era-specific LAR tables — one aggregate table (1981–1989, bank/census-tract
level) and three application-level tables (1990–2006, 2007–2017, 2018–2024) —
plus two lender tables.
Following the approach demonstrated in Paul's Substack post on Large Datasets and Structured Databases: Claude Code for Economists and Markus Academy videos (where he showcases this workflow with HMDA data), the pipeline uses DuckDB-native CSV reading for the three large eras (1990+), streaming data directly from CSV to database without loading into memory. This keeps RAM usage low even for years with 40M+ rows — compared to high peak RAM when using pandas DataFrames for the same data. Pandas is only used for the small files (1981–1989 aggregates and lender Excel files).
Getting Started
Database Schema
- Tables — era-specific tables, row counts, schema differences, column renames, loan amount units
Usage
- Common Sample Filters — the 3 CFPB filter levels with SQL for each era
- Exporting to Stata — Parquet, CSV, aggregated export
Reference
- Data Sources — source URLs
- Known Issues — bad ZIPs, lender linkage, CFPB vintages
- Adding New Years — how to update for new CFPB releases
Operations
- Project Structure
- Building and Resuming — fresh/resume, build times, quality checks, troubleshooting
- Requirements
Prerequisites: Python 3.11+ and uv.
uv sync
uv run python build_database.py --freshThe CFPB and Philadelphia Fed data is auto-downloaded and cached in data/raw/.
ICPSR data (1981-2006) must be pre-downloaded manually (requires ICPSR login)
and placed in data/raw/icpsr/. See
ICPSR project 151921.
HMDA data spans 40+ years with fundamentally different schemas across eras —
different columns, different lender identifiers, and different variable coding.
Rather than forcing everything into one table with hundreds of mostly-NULL
columns, the pipeline stores each era separately and provides a cross-era
lar view over the common columns.
| Table | Description | Lender ID | Years | ~Rows/yr | Total |
|---|---|---|---|---|---|
lar_1981_1989 |
ICPSR LAR (bank/tract aggregates) | respondent_id + agency_code | 1981-1989 | 0.4-0.6M | 4.6M |
lar_1990_2006 |
ICPSR LAR (application-level) | respondent_id + agency_code | 1990-2006 | 7-42M | 368.6M |
lar_2007_2017 |
CFPB legacy LAR (codes only) | respondent_id + agency_code | 2007-2017 | 12-27M | 187.5M |
lar_2018_onward |
CFPB Snapshot LAR | LEI | 2018-2024 | 11-26M | 124.1M |
lender_legacy |
Philly Fed lender file (Avery File) | HMPRID + CODE | 1990-2017 | -- | 230K |
lender_current |
Philly Fed lender file | LEI | 2018-present | -- | 35K |
lar (view) |
Cross-era union of 26 common columns (application-level only) | -- | 1990-present | -- | 680M |
_metadata |
Table inventory | -- | -- | -- | -- |
Every LAR table has activity_year for filtering by year. Lender tables
use year. Variable definitions for the lender file:
Philly Fed documentation.
| Feature | lar_1981_1989 | lar_1990_2006 | lar_2007_2017 | lar_2018_onward |
|---|---|---|---|---|
| Granularity | Bank x census tract | Application | Application | Application |
| Lender ID | respondent_id + agency_code | respondent_id + agency_code | respondent_id + agency_code | LEI |
| Key columns | num_govt, vol_conv, etc. | action_taken, loan_type, etc. | Same as 1990-2006 | derived_race, derived_sex, etc. |
| Dwelling type filter | N/A | property_type (integer, 2004+ only) | property_type (integer) | derived_dwelling_category (STRING) |
| Occupancy filter | N/A | owner_occupancy (integer) | owner_occupancy (integer) | occupancy_type (integer) |
| Race/ethnicity | N/A | applicant_race_1 (code) | applicant_race_1 (code) | derived_race (string) |
| Columns | ~25 | ~23 (1990-2003); ~36 (2004-2006) | ~25 | ~99 |
The pipeline harmonizes column names that changed between years:
| Era | Original column | Renamed to | Notes |
|---|---|---|---|
| 1990-2003 (ICPSR) | occupancy_type |
owner_occupancy |
Name standardized |
| 2004-2006 (ICPSR) | occupancy |
owner_occupancy |
Same concept, shorter name in source |
| 1990-2006 (ICPSR) | loan_amount |
loan_amount_000s |
Values are in thousands; renamed to match unit |
| 2007-2017 (CFPB) | as_of_year |
activity_year |
Standardized across eras |
| 2018 (CFPB) | loan_to_value_ratio |
combined_loan_to_value_ratio |
See note below |
Note on combined_loan_to_value_ratio: The 2018 CSV shipped with a
shortened column name (loan_to_value_ratio); 2019+ uses the official name per
CFPB LAR data fields.
We standardize to the official name.
Loan amount units differ across eras:
| Table | Column | Unit |
|---|---|---|
lar_1990_2006 |
loan_amount_000s |
Thousands of dollars |
lar_2007_2017 |
loan_amount_000s |
Thousands of dollars |
lar_2018_onward |
loan_amount |
Dollars |
When querying across eras via the lar view, multiply loan_amount_000s * 1000
to get dollars, or divide loan_amount / 1000 to get thousands.
The CFPB CSV files for 2007-2017 include both code columns (e.g., action_taken = 1)
and label columns (e.g., action_taken_name = "Loan originated"). All *_name and
*_abbr columns are dropped -- they are verified 1-to-1 with the integer codes
per the CFPB data dictionary.
Exception: county_name is kept because the county_name / county_code
mapping is not 1-to-1.
CFPB Snapshot data uses the strings "NA" and "Exempt" for missing/not-applicable
values. These are converted to NULL in the database.
These match the three filter levels published by CFPB on their
historic data page
(verified exact-match in --quality step 1 for 2007-2017).
All applications, denials, originations, and institutional purchases.
No WHERE clause needed -- use the full table.
-- All eras
WHERE action_taken = 1-- 1990-2003 (lar_1990_2006): property_type and lien_status not available
WHERE action_taken = 1 AND owner_occupancy = 1
-- 2004-2006 (lar_1990_2006)
WHERE action_taken = 1 AND property_type = 1
AND lien_status = 1 AND owner_occupancy = 1
-- 2007-2017 (lar_2007_2017)
WHERE action_taken = 1 AND property_type = 1
AND lien_status = 1 AND owner_occupancy = 1
-- 2018+ (lar_2018_onward): note STRING filter for dwelling category
WHERE action_taken = 1
AND derived_dwelling_category = 'Single Family (1-4 Units):Site-Built'
AND lien_status = 1 AND occupancy_type = 1Note: CFPB labels this filter "Mortgages for first lien, owner-occupied,
1-4 family" but it implicitly requires action_taken = 1 (originations only).
COPY (
SELECT * FROM lar_2007_2017
WHERE activity_year = 2015
AND state_code = 36
) TO 'lar_2015_ny.parquet' (FORMAT PARQUET);Then in Stata (requires the pq package -- ssc install pq):
pq use "lar_2015_ny.parquet", clearCOPY (
SELECT * FROM lar_2007_2017
WHERE activity_year = 2015
) TO 'lar_2015.csv' (HEADER, DELIMITER ',');import delimited "lar_2015.csv", clearCOPY (
SELECT respondent_id, agency_code, county_code, activity_year,
SUM(CASE WHEN action_taken = 1 THEN loan_amount_000s ELSE 0 END) AS amt_orig,
COUNT(CASE WHEN action_taken = 1 THEN 1 END) AS num_orig,
COUNT(*) AS apps
FROM lar_2007_2017
WHERE property_type = 1
AND lien_status = 1
AND owner_occupancy = 1
AND action_taken NOT IN (4, 5)
GROUP BY respondent_id, agency_code, county_code, activity_year
) TO 'hmda_agg.parquet' (FORMAT PARQUET);| Source | URL | Coverage |
|---|---|---|
| ICPSR (Andrew Forrester) | https://www.openicpsr.org/openicpsr/project/151921/version/V1/view | LAR 1981-2006 |
| NARA (official flat files) | https://catalog.archives.gov/id/2456161 | LAR 1981-2014 |
| CFPB historic data | https://www.consumerfinance.gov/data-research/hmda/historic-data/ | LAR 2007-2017 |
| CFPB Snapshot (used by pipeline) | https://ffiec.cfpb.gov/data-browser/ | LAR 2018+ (Snapshot vintage) |
| CFPB static data FAQ | https://ffiec.cfpb.gov/documentation/faq/static-dataset-faq | Explains Snapshot vs. One-Year vs. Three-Year vintages |
| CFPB LAR data fields | https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/lar-data-fields | Variable dictionary |
| Philadelphia Fed lender file | https://www.philadelphiafed.org/surveys-and-data/consumer-finance-data/home-mortgage-disclosure-act-lender-file | Lender 1990-present |
- 2002-2006 ICPSR ZIPs: Have bad CRC headers. The build script falls back to
system
unzip(which is more lenient) when Python'szipfilemodule fails. - Lender file linkage: The two lender tables (legacy and current) are not yet linked across the 2018 respondent_id-to-LEI transition. Use the ARID2017-to-LEI reference table from the CFPB Snapshot for bridging.
- CFPB data vintages (2018 onwards): CFPB publishes three dataset vintages
(Snapshot, One-Year, Three-Year) that differ by when corrections are captured.
The pipeline downloads from the Snapshot endpoint using a uniform filename
(
{year}_public_lar_csv.zip). For older years, the Snapshot file may have been replaced by the more complete Three-Year or One-Year version at the same URL. The pipeline does not currently support choosing between vintages. If a specific vintage is needed, manually download the ZIP and place it indata/raw/cfpb_snapshot/{year}/-- the pipeline will skip the download and use the cached file.
When CFPB publishes a new year's Snapshot (typically ~1 month after the April 1
filing deadline), update two constants in build_database.py:
CFPB_SNAPSHOT_YEARS = list(range(2018, 2026)) # bump end year
CFPB_SNAPSHOT_ZIPS: dict[int, str] = {
y: f"{y}_public_lar_csv.zip" for y in range(2018, 2026) # same
}Also update EXPECTED["lar_2018_onward"]["years"] to extend the expected range.
Then rebuild just that era:
caffeinate -i uv run python build_database.py 2018_onward
uv run python build_database.py --validateFor the lender file, the Philadelphia Fed updates both Excel files periodically.
Delete the cached files in data/lender/ and rebuild:
rm data/lender/*.xlsx
caffeinate -i uv run python build_database.py lenderhmda_database/ # GitHub repo
├── build_database.py # Python ETL pipeline
├── pyproject.toml # Python dependencies
├── QUALITY_CHECKS.md # Data quality check documentation
├── LICENSE # MIT
└── README.md # This file
data/ # Data directory (gitignored)
├── raw/
│ ├── icpsr/ # Pre-downloaded: ICPSR ZIPs (1981-2006)
│ ├── cfpb_legacy/ # Auto-downloaded: 2007-2017 LAR ZIPs
│ └── cfpb_snapshot/ # Auto-downloaded: 2018+ LAR ZIPs
├── lender/ # Auto-downloaded: Philly Fed Excel files
├── logs/ # Timestamped build and validation logs
hmda.duckdb # Output database (~27 GB)
The build is resumable: it only drops the tables you specify, preserving
completed ones. Use --fresh to delete the entire database and start over.
On macOS, prefix with caffeinate -i to prevent idle sleep during long builds.
uv run python build_database.py --status # check progress
caffeinate -i uv run python build_database.py --fresh # full fresh build
caffeinate -i uv run python build_database.py 2007_2017 # rebuild one era
uv run python build_database.py --validate # check year coverage
uv run python build_database.py --quality # detailed quality checksBuild times (Apple M4 Pro / 48 GB RAM, cached downloads):
| Era | Time |
|---|---|
lender |
~10 sec |
1981_1989 |
~15 sec |
1990_2006 |
~3 min |
2007_2017 |
~5 min (~30 min first download) |
2018_onward |
~11 min (~20 min first download) |
| Full rebuild | ~20 min |
Storage by era:
| Era | Raw downloads | DB rows | Notes |
|---|---|---|---|
1981_1989 + 1990_2006 |
29 GB | 373M | ICPSR; pre-downloaded, not auto-fetched |
2007_2017 |
151 GB | 187M | ZIPs + extracted CSVs cached |
2018_onward |
51 GB | 124M | ZIPs + extracted CSVs cached |
lender |
81 MB | 265K | Two Excel files |
| Total | 231 GB | 685M | DB: 27 GB |
Logs are saved to data/logs/ (build_*.log, validate_*.log, quality_*.log).
Quality checks: Run uv run python build_database.py --quality for
detailed data validation (row counts benchmarked against CFPB published counts,
column coverage, categorical distributions, loan amount units, lender join rates,
cross-era boundary consistency). See QUALITY_CHECKS.md
for documentation of each check step.
Troubleshooting: Build one era at a time to isolate issues. Check the log
for the specific error. Common causes: CFPB/Philly Fed URL changes (404),
ICPSR ZIP filenames not matching HMDA_LAR_YYYY.zip, or lender Excel column
renames.
- Python 3.11+
- uv (or pip:
pip install duckdb pandas requests openpyxl) - ICPSR data pre-downloaded (requires login)
- ~260 GB disk space (231 GB raw ZIPs + 27 GB database)
- Internet connection for CFPB and Philly Fed downloads
Code: MIT. Data: see individual source licenses above. HMDA data is collected under federal regulation and publicly available from CFPB. ICPSR data requires an account but is free for academic use.