Skip to content

truan/hmda_database

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HMDA Database

Build a harmonized DuckDB database from 40+ years of the Home Mortgage Disclosure Act (HMDA) data in a single command. Modeled after Paul Goldsmith-Pinkham's IPEDS database pipeline.

uv sync
uv run python build_database.py --fresh

The result is hmda.duckdb (~27 GB) containing 685 million rows across four era-specific LAR tables — one aggregate table (1981–1989, bank/census-tract level) and three application-level tables (1990–2006, 2007–2017, 2018–2024) — plus two lender tables.

Following the approach demonstrated in Paul's Substack post on Large Datasets and Structured Databases: Claude Code for Economists and Markus Academy videos (where he showcases this workflow with HMDA data), the pipeline uses DuckDB-native CSV reading for the three large eras (1990+), streaming data directly from CSV to database without loading into memory. This keeps RAM usage low even for years with 40M+ rows — compared to high peak RAM when using pandas DataFrames for the same data. Pandas is only used for the small files (1981–1989 aggregates and lender Excel files).

Table of Contents

Getting Started

Database Schema

  • Tables — era-specific tables, row counts, schema differences, column renames, loan amount units

Usage

Reference

Operations


Quick Start

Prerequisites: Python 3.11+ and uv.

uv sync
uv run python build_database.py --fresh

The CFPB and Philadelphia Fed data is auto-downloaded and cached in data/raw/. ICPSR data (1981-2006) must be pre-downloaded manually (requires ICPSR login) and placed in data/raw/icpsr/. See ICPSR project 151921.


Database Schema

Tables

HMDA data spans 40+ years with fundamentally different schemas across eras — different columns, different lender identifiers, and different variable coding. Rather than forcing everything into one table with hundreds of mostly-NULL columns, the pipeline stores each era separately and provides a cross-era lar view over the common columns.

Table Description Lender ID Years ~Rows/yr Total
lar_1981_1989 ICPSR LAR (bank/tract aggregates) respondent_id + agency_code 1981-1989 0.4-0.6M 4.6M
lar_1990_2006 ICPSR LAR (application-level) respondent_id + agency_code 1990-2006 7-42M 368.6M
lar_2007_2017 CFPB legacy LAR (codes only) respondent_id + agency_code 2007-2017 12-27M 187.5M
lar_2018_onward CFPB Snapshot LAR LEI 2018-2024 11-26M 124.1M
lender_legacy Philly Fed lender file (Avery File) HMPRID + CODE 1990-2017 -- 230K
lender_current Philly Fed lender file LEI 2018-present -- 35K
lar (view) Cross-era union of 26 common columns (application-level only) -- 1990-present -- 680M
_metadata Table inventory -- -- -- --

Every LAR table has activity_year for filtering by year. Lender tables use year. Variable definitions for the lender file: Philly Fed documentation.

Schema differences across eras

Feature lar_1981_1989 lar_1990_2006 lar_2007_2017 lar_2018_onward
Granularity Bank x census tract Application Application Application
Lender ID respondent_id + agency_code respondent_id + agency_code respondent_id + agency_code LEI
Key columns num_govt, vol_conv, etc. action_taken, loan_type, etc. Same as 1990-2006 derived_race, derived_sex, etc.
Dwelling type filter N/A property_type (integer, 2004+ only) property_type (integer) derived_dwelling_category (STRING)
Occupancy filter N/A owner_occupancy (integer) owner_occupancy (integer) occupancy_type (integer)
Race/ethnicity N/A applicant_race_1 (code) applicant_race_1 (code) derived_race (string)
Columns ~25 ~23 (1990-2003); ~36 (2004-2006) ~25 ~99

Column renames across years

The pipeline harmonizes column names that changed between years:

Era Original column Renamed to Notes
1990-2003 (ICPSR) occupancy_type owner_occupancy Name standardized
2004-2006 (ICPSR) occupancy owner_occupancy Same concept, shorter name in source
1990-2006 (ICPSR) loan_amount loan_amount_000s Values are in thousands; renamed to match unit
2007-2017 (CFPB) as_of_year activity_year Standardized across eras
2018 (CFPB) loan_to_value_ratio combined_loan_to_value_ratio See note below

Note on combined_loan_to_value_ratio: The 2018 CSV shipped with a shortened column name (loan_to_value_ratio); 2019+ uses the official name per CFPB LAR data fields. We standardize to the official name.

Loan amount units differ across eras:

Table Column Unit
lar_1990_2006 loan_amount_000s Thousands of dollars
lar_2007_2017 loan_amount_000s Thousands of dollars
lar_2018_onward loan_amount Dollars

When querying across eras via the lar view, multiply loan_amount_000s * 1000 to get dollars, or divide loan_amount / 1000 to get thousands.

Label column handling (2007-2017)

The CFPB CSV files for 2007-2017 include both code columns (e.g., action_taken = 1) and label columns (e.g., action_taken_name = "Loan originated"). All *_name and *_abbr columns are dropped -- they are verified 1-to-1 with the integer codes per the CFPB data dictionary.

Exception: county_name is kept because the county_name / county_code mapping is not 1-to-1.

Missing values (2018+)

CFPB Snapshot data uses the strings "NA" and "Exempt" for missing/not-applicable values. These are converted to NULL in the database.


Usage

Common Sample Filters

These match the three filter levels published by CFPB on their historic data page (verified exact-match in --quality step 1 for 2007-2017).

All records (no filter)

All applications, denials, originations, and institutional purchases. No WHERE clause needed -- use the full table.

All originations

-- All eras
WHERE action_taken = 1

Originations for first-lien, owner-occupied, single-family homes

-- 1990-2003 (lar_1990_2006): property_type and lien_status not available
WHERE action_taken = 1 AND owner_occupancy = 1

-- 2004-2006 (lar_1990_2006)
WHERE action_taken = 1 AND property_type = 1
  AND lien_status = 1 AND owner_occupancy = 1

-- 2007-2017 (lar_2007_2017)
WHERE action_taken = 1 AND property_type = 1
  AND lien_status = 1 AND owner_occupancy = 1

-- 2018+ (lar_2018_onward): note STRING filter for dwelling category
WHERE action_taken = 1
  AND derived_dwelling_category = 'Single Family (1-4 Units):Site-Built'
  AND lien_status = 1 AND occupancy_type = 1

Note: CFPB labels this filter "Mortgages for first lien, owner-occupied, 1-4 family" but it implicitly requires action_taken = 1 (originations only).

Exporting to Stata

Method A: via Parquet (recommended)

COPY (
    SELECT * FROM lar_2007_2017
    WHERE activity_year = 2015
      AND state_code = 36
) TO 'lar_2015_ny.parquet' (FORMAT PARQUET);

Then in Stata (requires the pq package -- ssc install pq):

pq use "lar_2015_ny.parquet", clear

Method B: via CSV (fallback)

COPY (
    SELECT * FROM lar_2007_2017
    WHERE activity_year = 2015
) TO 'lar_2015.csv' (HEADER, DELIMITER ',');
import delimited "lar_2015.csv", clear

Method C: Aggregated export

COPY (
    SELECT respondent_id, agency_code, county_code, activity_year,
           SUM(CASE WHEN action_taken = 1 THEN loan_amount_000s ELSE 0 END) AS amt_orig,
           COUNT(CASE WHEN action_taken = 1 THEN 1 END) AS num_orig,
           COUNT(*) AS apps
    FROM   lar_2007_2017
    WHERE  property_type = 1
      AND  lien_status = 1
      AND  owner_occupancy = 1
      AND  action_taken NOT IN (4, 5)
    GROUP  BY respondent_id, agency_code, county_code, activity_year
) TO 'hmda_agg.parquet' (FORMAT PARQUET);

Reference

Data Sources

Source URL Coverage
ICPSR (Andrew Forrester) https://www.openicpsr.org/openicpsr/project/151921/version/V1/view LAR 1981-2006
NARA (official flat files) https://catalog.archives.gov/id/2456161 LAR 1981-2014
CFPB historic data https://www.consumerfinance.gov/data-research/hmda/historic-data/ LAR 2007-2017
CFPB Snapshot (used by pipeline) https://ffiec.cfpb.gov/data-browser/ LAR 2018+ (Snapshot vintage)
CFPB static data FAQ https://ffiec.cfpb.gov/documentation/faq/static-dataset-faq Explains Snapshot vs. One-Year vs. Three-Year vintages
CFPB LAR data fields https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/lar-data-fields Variable dictionary
Philadelphia Fed lender file https://www.philadelphiafed.org/surveys-and-data/consumer-finance-data/home-mortgage-disclosure-act-lender-file Lender 1990-present

Known Issues

  • 2002-2006 ICPSR ZIPs: Have bad CRC headers. The build script falls back to system unzip (which is more lenient) when Python's zipfile module fails.
  • Lender file linkage: The two lender tables (legacy and current) are not yet linked across the 2018 respondent_id-to-LEI transition. Use the ARID2017-to-LEI reference table from the CFPB Snapshot for bridging.
  • CFPB data vintages (2018 onwards): CFPB publishes three dataset vintages (Snapshot, One-Year, Three-Year) that differ by when corrections are captured. The pipeline downloads from the Snapshot endpoint using a uniform filename ({year}_public_lar_csv.zip). For older years, the Snapshot file may have been replaced by the more complete Three-Year or One-Year version at the same URL. The pipeline does not currently support choosing between vintages. If a specific vintage is needed, manually download the ZIP and place it in data/raw/cfpb_snapshot/{year}/ -- the pipeline will skip the download and use the cached file.

Adding New Years

When CFPB publishes a new year's Snapshot (typically ~1 month after the April 1 filing deadline), update two constants in build_database.py:

CFPB_SNAPSHOT_YEARS   = list(range(2018, 2026))  # bump end year
CFPB_SNAPSHOT_ZIPS: dict[int, str] = {
    y: f"{y}_public_lar_csv.zip" for y in range(2018, 2026)  # same
}

Also update EXPECTED["lar_2018_onward"]["years"] to extend the expected range. Then rebuild just that era:

caffeinate -i uv run python build_database.py 2018_onward
uv run python build_database.py --validate

For the lender file, the Philadelphia Fed updates both Excel files periodically. Delete the cached files in data/lender/ and rebuild:

rm data/lender/*.xlsx
caffeinate -i uv run python build_database.py lender

Operations

Project Structure

hmda_database/                # GitHub repo
├── build_database.py         # Python ETL pipeline
├── pyproject.toml            # Python dependencies
├── QUALITY_CHECKS.md         # Data quality check documentation
├── LICENSE                   # MIT
└── README.md                 # This file

data/                         # Data directory (gitignored)
├── raw/
│   ├── icpsr/                # Pre-downloaded: ICPSR ZIPs (1981-2006)
│   ├── cfpb_legacy/          # Auto-downloaded: 2007-2017 LAR ZIPs
│   └── cfpb_snapshot/        # Auto-downloaded: 2018+ LAR ZIPs
├── lender/                   # Auto-downloaded: Philly Fed Excel files
├── logs/                     # Timestamped build and validation logs
hmda.duckdb                   # Output database (~27 GB)

Building and Resuming

The build is resumable: it only drops the tables you specify, preserving completed ones. Use --fresh to delete the entire database and start over.

On macOS, prefix with caffeinate -i to prevent idle sleep during long builds.

uv run python build_database.py --status                   # check progress
caffeinate -i uv run python build_database.py --fresh      # full fresh build
caffeinate -i uv run python build_database.py 2007_2017    # rebuild one era
uv run python build_database.py --validate                 # check year coverage
uv run python build_database.py --quality                  # detailed quality checks

Build times (Apple M4 Pro / 48 GB RAM, cached downloads):

Era Time
lender ~10 sec
1981_1989 ~15 sec
1990_2006 ~3 min
2007_2017 ~5 min (~30 min first download)
2018_onward ~11 min (~20 min first download)
Full rebuild ~20 min

Storage by era:

Era Raw downloads DB rows Notes
1981_1989 + 1990_2006 29 GB 373M ICPSR; pre-downloaded, not auto-fetched
2007_2017 151 GB 187M ZIPs + extracted CSVs cached
2018_onward 51 GB 124M ZIPs + extracted CSVs cached
lender 81 MB 265K Two Excel files
Total 231 GB 685M DB: 27 GB

Logs are saved to data/logs/ (build_*.log, validate_*.log, quality_*.log).

Quality checks: Run uv run python build_database.py --quality for detailed data validation (row counts benchmarked against CFPB published counts, column coverage, categorical distributions, loan amount units, lender join rates, cross-era boundary consistency). See QUALITY_CHECKS.md for documentation of each check step.

Troubleshooting: Build one era at a time to isolate issues. Check the log for the specific error. Common causes: CFPB/Philly Fed URL changes (404), ICPSR ZIP filenames not matching HMDA_LAR_YYYY.zip, or lender Excel column renames.

Requirements

  • Python 3.11+
  • uv (or pip: pip install duckdb pandas requests openpyxl)
  • ICPSR data pre-downloaded (requires login)
  • ~260 GB disk space (231 GB raw ZIPs + 27 GB database)
  • Internet connection for CFPB and Philly Fed downloads

License

Code: MIT. Data: see individual source licenses above. HMDA data is collected under federal regulation and publicly available from CFPB. ICPSR data requires an account but is free for academic use.

About

Build a harmonized DuckDB database from 40+ years of HMDA mortgage data (685M rows) in a single command.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages