HMDA Database

Build a harmonized DuckDB database from 40+ years of the Home Mortgage Disclosure Act (HMDA) data in a single command. Modeled after Paul Goldsmith-Pinkham's IPEDS database pipeline.

uv sync
uv run python build_database.py --fresh

The result is hmda.duckdb (~27 GB) containing 685 million rows across four era-specific LAR tables — one aggregate table (1981–1989, bank/census-tract level) and three application-level tables (1990–2006, 2007–2017, 2018–2024) — plus two lender tables.

Following the approach demonstrated in Paul's Substack post on Large Datasets and Structured Databases: Claude Code for Economists and Markus Academy videos (where he showcases this workflow with HMDA data), the pipeline uses DuckDB-native CSV reading for the three large eras (1990+), streaming data directly from CSV to database without loading into memory. This keeps RAM usage low even for years with 40M+ rows — compared to high peak RAM when using pandas DataFrames for the same data. Pandas is only used for the small files (1981–1989 aggregates and lender Excel files).

Quick Start

Prerequisites: Python 3.11+ and uv.

uv sync
uv run python build_database.py --fresh

The CFPB and Philadelphia Fed data is auto-downloaded and cached in data/raw/. ICPSR data (1981-2006) must be pre-downloaded manually (requires ICPSR login) and placed in data/raw/icpsr/. See ICPSR project 151921.

Database Schema

Tables

HMDA data spans 40+ years with fundamentally different schemas across eras — different columns, different lender identifiers, and different variable coding. Rather than forcing everything into one table with hundreds of mostly-NULL columns, the pipeline stores each era separately and provides a cross-era lar view over the common columns.

Table	Description	Lender ID	Years	~Rows/yr	Total
`lar_1981_1989`	ICPSR LAR (bank/tract aggregates)	respondent_id + agency_code	1981-1989	0.4-0.6M	4.6M
`lar_1990_2006`	ICPSR LAR (application-level)	respondent_id + agency_code	1990-2006	7-42M	368.6M
`lar_2007_2017`	CFPB legacy LAR (codes only)	respondent_id + agency_code	2007-2017	12-27M	187.5M
`lar_2018_onward`	CFPB Snapshot LAR	LEI	2018-2024	11-26M	124.1M
`lender_legacy`	Philly Fed lender file (Avery File)	HMPRID + CODE	1990-2017	--	230K
`lender_current`	Philly Fed lender file	LEI	2018-present	--	35K
`lar` (view)	Cross-era union of 26 common columns (application-level only)	--	1990-present	--	680M
`_metadata`	Table inventory	--	--	--	--

Every LAR table has activity_year for filtering by year. Lender tables use year. Variable definitions for the lender file: Philly Fed documentation.

Schema differences across eras

Feature	lar_1981_1989	lar_1990_2006	lar_2007_2017	lar_2018_onward
Granularity	Bank x census tract	Application	Application	Application
Lender ID	respondent_id + agency_code	respondent_id + agency_code	respondent_id + agency_code	LEI
Key columns	num_govt, vol_conv, etc.	action_taken, loan_type, etc.	Same as 1990-2006	derived_race, derived_sex, etc.
Dwelling type filter	N/A	property_type (integer, 2004+ only)	property_type (integer)	derived_dwelling_category (STRING)
Occupancy filter	N/A	owner_occupancy (integer)	owner_occupancy (integer)	occupancy_type (integer)
Race/ethnicity	N/A	applicant_race_1 (code)	applicant_race_1 (code)	derived_race (string)
Columns	~25	~23 (1990-2003); ~36 (2004-2006)	~25	~99

Column renames across years

The pipeline harmonizes column names that changed between years:

Era	Original column	Renamed to	Notes
1990-2003 (ICPSR)	`occupancy_type`	`owner_occupancy`	Name standardized
2004-2006 (ICPSR)	`occupancy`	`owner_occupancy`	Same concept, shorter name in source
1990-2006 (ICPSR)	`loan_amount`	`loan_amount_000s`	Values are in thousands; renamed to match unit
2007-2017 (CFPB)	`as_of_year`	`activity_year`	Standardized across eras
2018 (CFPB)	`loan_to_value_ratio`	`combined_loan_to_value_ratio`	See note below

Note on combined_loan_to_value_ratio: The 2018 CSV shipped with a shortened column name (loan_to_value_ratio); 2019+ uses the official name per CFPB LAR data fields. We standardize to the official name.

Loan amount units differ across eras:

Table	Column	Unit
`lar_1990_2006`	`loan_amount_000s`	Thousands of dollars
`lar_2007_2017`	`loan_amount_000s`	Thousands of dollars
`lar_2018_onward`	`loan_amount`	Dollars

When querying across eras via the lar view, multiply loan_amount_000s * 1000 to get dollars, or divide loan_amount / 1000 to get thousands.

Label column handling (2007-2017)

The CFPB CSV files for 2007-2017 include both code columns (e.g., action_taken = 1) and label columns (e.g., action_taken_name = "Loan originated"). All *_name and *_abbr columns are dropped -- they are verified 1-to-1 with the integer codes per the CFPB data dictionary.

Exception: county_name is kept because the county_name / county_code mapping is not 1-to-1.

Missing values (2018+)

CFPB Snapshot data uses the strings "NA" and "Exempt" for missing/not-applicable values. These are converted to NULL in the database.

Usage

Common Sample Filters

These match the three filter levels published by CFPB on their historic data page (verified exact-match in --quality step 1 for 2007-2017).

All records (no filter)

All applications, denials, originations, and institutional purchases. No WHERE clause needed -- use the full table.

All originations

-- All eras
WHERE action_taken = 1

Originations for first-lien, owner-occupied, single-family homes

-- 1990-2003 (lar_1990_2006): property_type and lien_status not available
WHERE action_taken = 1 AND owner_occupancy = 1

-- 2004-2006 (lar_1990_2006)
WHERE action_taken = 1 AND property_type = 1
  AND lien_status = 1 AND owner_occupancy = 1

-- 2007-2017 (lar_2007_2017)
WHERE action_taken = 1 AND property_type = 1
  AND lien_status = 1 AND owner_occupancy = 1

-- 2018+ (lar_2018_onward): note STRING filter for dwelling category
WHERE action_taken = 1
  AND derived_dwelling_category = 'Single Family (1-4 Units):Site-Built'
  AND lien_status = 1 AND occupancy_type = 1

Note: CFPB labels this filter "Mortgages for first lien, owner-occupied, 1-4 family" but it implicitly requires action_taken = 1 (originations only).

Exporting to Stata

Method A: via Parquet (recommended)

COPY (
    SELECT * FROM lar_2007_2017
    WHERE activity_year = 2015
      AND state_code = 36
) TO 'lar_2015_ny.parquet' (FORMAT PARQUET);

Then in Stata (requires the pq package -- ssc install pq):

pq use "lar_2015_ny.parquet", clear

Method B: via CSV (fallback)

COPY (
    SELECT * FROM lar_2007_2017
    WHERE activity_year = 2015
) TO 'lar_2015.csv' (HEADER, DELIMITER ',');

import delimited "lar_2015.csv", clear

Method C: Aggregated export

COPY (
    SELECT respondent_id, agency_code, county_code, activity_year,
           SUM(CASE WHEN action_taken = 1 THEN loan_amount_000s ELSE 0 END) AS amt_orig,
           COUNT(CASE WHEN action_taken = 1 THEN 1 END) AS num_orig,
           COUNT(*) AS apps
    FROM   lar_2007_2017
    WHERE  property_type = 1
      AND  lien_status = 1
      AND  owner_occupancy = 1
      AND  action_taken NOT IN (4, 5)
    GROUP  BY respondent_id, agency_code, county_code, activity_year
) TO 'hmda_agg.parquet' (FORMAT PARQUET);

Reference

Data Sources

Source	URL	Coverage
ICPSR (Andrew Forrester)	https://www.openicpsr.org/openicpsr/project/151921/version/V1/view	LAR 1981-2006
NARA (official flat files)	https://catalog.archives.gov/id/2456161	LAR 1981-2014
CFPB historic data	https://www.consumerfinance.gov/data-research/hmda/historic-data/	LAR 2007-2017
CFPB Snapshot (used by pipeline)	https://ffiec.cfpb.gov/data-browser/	LAR 2018+ (Snapshot vintage)
CFPB static data FAQ	https://ffiec.cfpb.gov/documentation/faq/static-dataset-faq	Explains Snapshot vs. One-Year vs. Three-Year vintages
CFPB LAR data fields	https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/lar-data-fields	Variable dictionary
Philadelphia Fed lender file	https://www.philadelphiafed.org/surveys-and-data/consumer-finance-data/home-mortgage-disclosure-act-lender-file	Lender 1990-present

Known Issues

2002-2006 ICPSR ZIPs: Have bad CRC headers. The build script falls back to system unzip (which is more lenient) when Python's zipfile module fails.
Lender file linkage: The two lender tables (legacy and current) are not yet linked across the 2018 respondent_id-to-LEI transition. Use the ARID2017-to-LEI reference table from the CFPB Snapshot for bridging.
CFPB data vintages (2018 onwards): CFPB publishes three dataset vintages (Snapshot, One-Year, Three-Year) that differ by when corrections are captured. The pipeline downloads from the Snapshot endpoint using a uniform filename ({year}_public_lar_csv.zip). For older years, the Snapshot file may have been replaced by the more complete Three-Year or One-Year version at the same URL. The pipeline does not currently support choosing between vintages. If a specific vintage is needed, manually download the ZIP and place it in data/raw/cfpb_snapshot/{year}/ -- the pipeline will skip the download and use the cached file.

Adding New Years

When CFPB publishes a new year's Snapshot (typically ~1 month after the April 1 filing deadline), update two constants in build_database.py:

CFPB_SNAPSHOT_YEARS   = list(range(2018, 2026))  # bump end year
CFPB_SNAPSHOT_ZIPS: dict[int, str] = {
    y: f"{y}_public_lar_csv.zip" for y in range(2018, 2026)  # same
}

Also update EXPECTED["lar_2018_onward"]["years"] to extend the expected range. Then rebuild just that era:

caffeinate -i uv run python build_database.py 2018_onward
uv run python build_database.py --validate

For the lender file, the Philadelphia Fed updates both Excel files periodically. Delete the cached files in data/lender/ and rebuild:

rm data/lender/*.xlsx
caffeinate -i uv run python build_database.py lender

Operations

Project Structure

hmda_database/                # GitHub repo
├── build_database.py         # Python ETL pipeline
├── pyproject.toml            # Python dependencies
├── QUALITY_CHECKS.md         # Data quality check documentation
├── LICENSE                   # MIT
└── README.md                 # This file

data/                         # Data directory (gitignored)
├── raw/
│   ├── icpsr/                # Pre-downloaded: ICPSR ZIPs (1981-2006)
│   ├── cfpb_legacy/          # Auto-downloaded: 2007-2017 LAR ZIPs
│   └── cfpb_snapshot/        # Auto-downloaded: 2018+ LAR ZIPs
├── lender/                   # Auto-downloaded: Philly Fed Excel files
├── logs/                     # Timestamped build and validation logs
hmda.duckdb                   # Output database (~27 GB)

Building and Resuming

The build is resumable: it only drops the tables you specify, preserving completed ones. Use --fresh to delete the entire database and start over.

On macOS, prefix with caffeinate -i to prevent idle sleep during long builds.

uv run python build_database.py --status                   # check progress
caffeinate -i uv run python build_database.py --fresh      # full fresh build
caffeinate -i uv run python build_database.py 2007_2017    # rebuild one era
uv run python build_database.py --validate                 # check year coverage
uv run python build_database.py --quality                  # detailed quality checks

Build times (Apple M4 Pro / 48 GB RAM, cached downloads):

Era	Time
`lender`	~10 sec
`1981_1989`	~15 sec
`1990_2006`	~3 min
`2007_2017`	~5 min (~30 min first download)
`2018_onward`	~11 min (~20 min first download)
Full rebuild	~20 min

Storage by era:

Era	Raw downloads	DB rows	Notes
`1981_1989` + `1990_2006`	29 GB	373M	ICPSR; pre-downloaded, not auto-fetched
`2007_2017`	151 GB	187M	ZIPs + extracted CSVs cached
`2018_onward`	51 GB	124M	ZIPs + extracted CSVs cached
`lender`	81 MB	265K	Two Excel files
Total	231 GB	685M	DB: 27 GB

Logs are saved to data/logs/ (build_*.log, validate_*.log, quality_*.log).

Quality checks: Run uv run python build_database.py --quality for detailed data validation (row counts benchmarked against CFPB published counts, column coverage, categorical distributions, loan amount units, lender join rates, cross-era boundary consistency). See QUALITY_CHECKS.md for documentation of each check step.

Troubleshooting: Build one era at a time to isolate issues. Check the log for the specific error. Common causes: CFPB/Philly Fed URL changes (404), ICPSR ZIP filenames not matching HMDA_LAR_YYYY.zip, or lender Excel column renames.

Requirements

Python 3.11+
uv (or pip: pip install duckdb pandas requests openpyxl)
ICPSR data pre-downloaded (requires login)
~260 GB disk space (231 GB raw ZIPs + 27 GB database)
Internet connection for CFPB and Philly Fed downloads

License

Code: MIT. Data: see individual source licenses above. HMDA data is collected under federal regulation and publicly available from CFPB. ICPSR data requires an account but is free for academic use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HMDA Database

Table of Contents

Quick Start

Database Schema

Tables

Schema differences across eras

Column renames across years

Label column handling (2007-2017)

Missing values (2018+)

Usage

Common Sample Filters

All records (no filter)

All originations

Originations for first-lien, owner-occupied, single-family homes

Exporting to Stata

Method A: via Parquet (recommended)

Method B: via CSV (fallback)

Method C: Aggregated export

Reference

Data Sources

Known Issues

Adding New Years

Operations

Project Structure

Building and Resuming

Requirements

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
QUALITY_CHECKS.md		QUALITY_CHECKS.md
README.md		README.md
build_database.py		build_database.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

HMDA Database

Table of Contents

Quick Start

Database Schema

Tables

Schema differences across eras

Column renames across years

Label column handling (2007-2017)

Missing values (2018+)

Usage

Common Sample Filters

All records (no filter)

All originations

Originations for first-lien, owner-occupied, single-family homes

Exporting to Stata

Method A: via Parquet (recommended)

Method B: via CSV (fallback)

Method C: Aggregated export

Reference

Data Sources

Known Issues

Adding New Years

Operations

Project Structure

Building and Resuming

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages