This repository contains the analytical pipelines used to link clinical patient cohorts with geographic Non-Medical Health Factor (NMHF) datasets. NMHF indicators describe social, economic, and environmental conditions that influence health outcomes and population health equity. Examples include variables derived from datasets such as the CDC Social Vulnerability Index (SVI), Census socioeconomic indicators, and other publicly available geospatial datasets.
The scripts provided in this repository demonstrate how de-identified clinical cohorts can be linked to NMHF indicators using geographic identifiers such as Census tract Federal Information Processing Standards (FIPS) codes.
To support transparency and reproducibility, this repository provides:
- Python and R analytical pipelines used for NMHF linkage
- example SQL queries used in Google BigQuery
- data quality (QC) metrics and validation outputs
- documentation describing NMHF datasets and indicators
These materials accompany the NMHF data infrastructure described in the associated research manuscript.
NMHF/
│
├── Analysis scripts/
│ ├── HIV_Lung_Cancer_FIPS_Analysis_(all_ages).ipynb
│ ├── Elderly_Cancer_Study_FIPS_Analysis.ipynb
│ ├── Palliative Care NMHF.Rmd
│
├── Documentation/
│ ├── NMHF Data Descriptions Short.pdf
│
├── QC/
│ ├── NMHF QC and metrics.pdf
│
├── SQL/
│ ├── Data Quality Summary.sql
│ ├── Geospatial Linkage.sql
│ ├── Population-Weighted Geographic Crosswalking.sql
│
└── README.md
File: HIV_Lung_Cancer_FIPS_Analysis.ipynb
Description
This pipeline manages spatial linkage for the HIV and Lung Cancer clinical cohort. It merges the de-identified patient dataset with NMHF baseline data using the 11-digit Census Tract identifier (FIPS11).
Quality Control
The pipeline includes an automated data quality audit that:
- validates geographic conformance (ensuring valid FIPS codes are present)
- verifies that NMHF variables (e.g., SVI metrics) are correctly parsed as numeric values
- flags records containing suppression codes or implausible values
File: Elderly_Cancer_Study_FIPS_Analysis.ipynb
Description
This script executes NMHF integration for the Aging Population Cancer study. It performs a deterministic spatial join linking patient residence locations to Census-tract-level NMHF indicators.
Quality Control
The pipeline includes standardized QC functions that:
-
measure row-level attrition during cohort linkage
-
calculate the proportion of the final cohort with complete NMHF indicators
-
validate core socioeconomic indicators such as:
-
poverty rate (
EP_POV150) -
unemployment (
EP_UNEMP)
File: Palliative_Care_NMHF.Rmd
Description
This R-based workflow manages NMHF integration for the Palliative Care cohort.
This pipeline performs geographic harmonization by converting 12-digit Census Block Group identifiers (FIPS12) to 11-digit Census Tract identifiers (FIPS11) to ensure spatial compatibility with the NMHF baseline dataset.
Quality Control
The R pipeline replicates the Python QC framework and includes:
- validation of geographic identifiers prior to linkage
- post-merge completeness checks
- filtering of suppressed or invalid NMHF indicators
Descriptions of NMHF datasets and indicators used within the infrastructure are provided in the documentation/ folder. These materials summarize the domains covered by the integrated datasets, including socioeconomic indicators, neighborhood vulnerability metrics, healthcare resource availability, and population demographics.
During deterministic linkage between clinical cohorts and NMHF datasets, the number of records in the final analytical dataset may be smaller than the original clinical cohort size.
This difference between Rows Before and Rows After is expected and results from the application of standardized quality control procedures.
Records may be excluded for the following reasons:
Clinical records without valid residential information cannot be geocoded to a Census geographic identifier (FIPS). These records cannot be linked to NMHF datasets and therefore result in NULL NMHF values.
Quality control scripts enforce strict bounds on continuous variables. For example, socioeconomic indicators such as poverty or unemployment must fall within valid percentage ranges (0–100). Records containing mathematically implausible values are excluded to preserve analytical validity.
Some public datasets include suppression codes or placeholder values for privacy protection. These values are detected and filtered during QC procedures.
By applying these standardized QC procedures, downstream analyses—including statistical modeling and machine learning workflows—are performed using datasets that are geographically valid, complete, and analytically reliable.
Example QC metrics included in this repository evaluate:
- geographic conformance (valid FIPS identifiers)
- schema validation (expected variable types)
- completeness (percentage of records with complete NMHF indicators)
- plausibility checks (detection of out-of-range values)
These QC summaries are available in the qc/ directory.
The sql/ directory contains example SQL queries demonstrating common analytical workflows supported by the NMHF infrastructure. These examples illustrate how researchers can interact with NMHF datasets stored in Google BigQuery for data quality validation, cohort linkage, and geographic harmonization.
The harmonized NMHF datasets used in the associated research project are hosted in Google BigQuery.
Researchers interested in accessing the NMHF data infrastructure may submit an access request through the project request form:
https://docs.google.com/forms/d/e/1FAIpQLSdPCBh2IwB4wJ80VKuCNIs9dfLUGnwiyzuxAM83q_6DxQm2Dw/viewform
This repository provides analytical scripts and documentation for research and educational use. Clinical datasets referenced in example pipelines are not included and remain subject to institutional data governance.