Non-Medical Health Factor (NMHF) Data Integration & Quality Assurance

Overview

This repository contains the analytical pipelines used to link clinical patient cohorts with geographic Non-Medical Health Factor (NMHF) datasets. NMHF indicators describe social, economic, and environmental conditions that influence health outcomes and population health equity. Examples include variables derived from datasets such as the CDC Social Vulnerability Index (SVI), Census socioeconomic indicators, and other publicly available geospatial datasets.

The scripts provided in this repository demonstrate how de-identified clinical cohorts can be linked to NMHF indicators using geographic identifiers such as Census tract Federal Information Processing Standards (FIPS) codes.

To support transparency and reproducibility, this repository provides:

Python and R analytical pipelines used for NMHF linkage
example SQL queries used in Google BigQuery
data quality (QC) metrics and validation outputs
documentation describing NMHF datasets and indicators

These materials accompany the NMHF data infrastructure described in the associated research manuscript.

Repository Structure

NMHF/
│
├── Analysis scripts/
│   ├── HIV_Lung_Cancer_FIPS_Analysis_(all_ages).ipynb
│   ├── Elderly_Cancer_Study_FIPS_Analysis.ipynb
│   ├── Palliative Care NMHF.Rmd
│
├── Documentation/
│   ├── NMHF Data Descriptions Short.pdf
│
├── QC/
│   ├── NMHF QC and metrics.pdf
│
├── SQL/
│   ├── Data Quality Summary.sql
│   ├── Geospatial Linkage.sql
│   ├── Population-Weighted Geographic Crosswalking.sql
│
└── README.md

Analysis scripts

1. HIV Lung Cancer Cohort Analysis

File: HIV_Lung_Cancer_FIPS_Analysis.ipynb

Description

This pipeline manages spatial linkage for the HIV and Lung Cancer clinical cohort. It merges the de-identified patient dataset with NMHF baseline data using the 11-digit Census Tract identifier (FIPS11).

Quality Control

The pipeline includes an automated data quality audit that:

validates geographic conformance (ensuring valid FIPS codes are present)
verifies that NMHF variables (e.g., SVI metrics) are correctly parsed as numeric values
flags records containing suppression codes or implausible values

2. Elderly Cancer Study Cohort Analysis

File: Elderly_Cancer_Study_FIPS_Analysis.ipynb

Description

This script executes NMHF integration for the Aging Population Cancer study. It performs a deterministic spatial join linking patient residence locations to Census-tract-level NMHF indicators.

Quality Control

The pipeline includes standardized QC functions that:

measure row-level attrition during cohort linkage
calculate the proportion of the final cohort with complete NMHF indicators
validate core socioeconomic indicators such as:
poverty rate (EP_POV150)
unemployment (EP_UNEMP)

3. Palliative Care Cohort Analysis

File: Palliative_Care_NMHF.Rmd

Description

This R-based workflow manages NMHF integration for the Palliative Care cohort.

This pipeline performs geographic harmonization by converting 12-digit Census Block Group identifiers (FIPS12) to 11-digit Census Tract identifiers (FIPS11) to ensure spatial compatibility with the NMHF baseline dataset.

Quality Control

The R pipeline replicates the Python QC framework and includes:

validation of geographic identifiers prior to linkage
post-merge completeness checks
filtering of suppressed or invalid NMHF indicators

NMHF Dataset Documentation

Descriptions of NMHF datasets and indicators used within the infrastructure are provided in the documentation/ folder. These materials summarize the domains covered by the integrated datasets, including socioeconomic indicators, neighborhood vulnerability metrics, healthcare resource availability, and population demographics.

QC

Methodology: Data Quality and Cohort Attrition

During deterministic linkage between clinical cohorts and NMHF datasets, the number of records in the final analytical dataset may be smaller than the original clinical cohort size.

This difference between Rows Before and Rows After is expected and results from the application of standardized quality control procedures.

Records may be excluded for the following reasons:

1. Missing Geographic Identifiers

Clinical records without valid residential information cannot be geocoded to a Census geographic identifier (FIPS). These records cannot be linked to NMHF datasets and therefore result in NULL NMHF values.

2. Data Plausibility Validation

Quality control scripts enforce strict bounds on continuous variables. For example, socioeconomic indicators such as poverty or unemployment must fall within valid percentage ranges (0–100). Records containing mathematically implausible values are excluded to preserve analytical validity.

3. Suppressed or Invalid Values

Some public datasets include suppression codes or placeholder values for privacy protection. These values are detected and filtered during QC procedures.

By applying these standardized QC procedures, downstream analyses—including statistical modeling and machine learning workflows—are performed using datasets that are geographically valid, complete, and analytically reliable.

Data Quality Metrics

Example QC metrics included in this repository evaluate:

geographic conformance (valid FIPS identifiers)
schema validation (expected variable types)
completeness (percentage of records with complete NMHF indicators)
plausibility checks (detection of out-of-range values)

These QC summaries are available in the qc/ directory.

SQL

The sql/ directory contains example SQL queries demonstrating common analytical workflows supported by the NMHF infrastructure. These examples illustrate how researchers can interact with NMHF datasets stored in Google BigQuery for data quality validation, cohort linkage, and geographic harmonization.

Data Access

The harmonized NMHF datasets used in the associated research project are hosted in Google BigQuery.

Researchers interested in accessing the NMHF data infrastructure may submit an access request through the project request form:

https://docs.google.com/forms/d/e/1FAIpQLSdPCBh2IwB4wJ80VKuCNIs9dfLUGnwiyzuxAM83q_6DxQm2Dw/viewform

License

This repository provides analytical scripts and documentation for research and educational use. Clinical datasets referenced in example pipelines are not included and remain subject to institutional data governance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Non-Medical Health Factor (NMHF) Data Integration & Quality Assurance

Overview

Repository Structure

Analysis scripts

1. HIV Lung Cancer Cohort Analysis

2. Elderly Cancer Study Cohort Analysis

3. Palliative Care Cohort Analysis

NMHF Dataset Documentation

QC

Methodology: Data Quality and Cohort Attrition

1. Missing Geographic Identifiers

2. Data Plausibility Validation

3. Suppressed or Invalid Values

Data Quality Metrics

SQL

Data Access

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Analysis scripts		Analysis scripts
Documentation		Documentation
QC		QC
SQL		SQL
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Non-Medical Health Factor (NMHF) Data Integration & Quality Assurance

Overview

Repository Structure

Analysis scripts

1. HIV Lung Cancer Cohort Analysis

2. Elderly Cancer Study Cohort Analysis

3. Palliative Care Cohort Analysis

NMHF Dataset Documentation

QC

Methodology: Data Quality and Cohort Attrition

1. Missing Geographic Identifiers

2. Data Plausibility Validation

3. Suppressed or Invalid Values

Data Quality Metrics

SQL

Data Access

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages