Welcome to the RCM Healthcare Data Pipeline Project!
This repository contains an end-to-end data pipeline for processing Revenue Cycle Management (RCM) data from two fictional hospitals, transforming raw hospital records into actionable financial insights. Built using Microsoft Azure tools—Data Factory, Databricks, ADLS Gen2, SQL Database, Key Vault, and Unity Catalog—this project follows a Medallion architecture (Bronze → Silver → Gold) to automate data ingestion, cleaning, historization, and reporting.
Whether you are a Backend Software Engineer, a Data Engineer exploring Azure-based data pipelines, or a Healthcare Analyst working with RCM data, this project offers a comprehensive framework that can be leveraged for learning and adaptation.
- Overview
- Key Features
- Achievements & Improvements
- Data Sources
- Pipeline Steps & Flow
- End-User Access & Consumption
- Key Learnings
- Project Structure
- Contributing
- License
- Contact
The pipeline consists of two main stages:
-
Ingestion to Bronze (Raw Data Layer)
- Extracts raw Electronic Medical Records (EMR) data from Azure SQL Databases
- Loads it into Azure Data Lake Storage Gen2 (ADLS Gen2) Bronze layer as Parquet files
- Uses metadata-driven Azure Data Factory (ADF) pipelines for automation
-
Transformation to Silver & Gold (Processed & Business-ready Data)
- Cleans and historizes data in Databricks (Silver Layer)
- Creates business-ready Fact and Dimension tables in Delta format (Gold Layer)
The outcome is a scalable, secure, and fully automated data pipeline that transforms raw, unstructured hospital data into clean, historized tables optimized for financial reporting—all documented within this repository.
- Metadata-Driven Pipelines – A single configuration file (
load_config.csv) orchestrates data ingestion for 10 tables, supporting:- Full loads for static data (e.g.,
departments). - Incremental loads for dynamic data (e.g.,
encounters).
- Full loads for static data (e.g.,
- Medallion Architecture – Implements a layered data flow:
- Bronze (Raw data storage in ADLS Gen2).
- Silver (Cleansed and historized data).
- Gold (Business-ready analytics layer).
- Dual Transformation Logic – Supports two data update strategies:
- Full Refresh – Replaces static Silver tables (e.g.,
departments) entirely. - Slowly Changing Dimension Type 2 (SCD2) – Tracks history in dynamic tables like
patients—preserves past records (e.g., old addresses) withis_currentflags.
- Full Refresh – Replaces static Silver tables (e.g.,
- Delta Tables – Silver and Gold layers utilize Delta format, enabling:
- ACID transactions for data integrity.
- Versioning & rollback capabilities for auditing and recovery.
- Enhanced Security with Key Vault – Stores sensitive credentials (e.g., SQL passwords) securely in Azure Key Vault, preventing hardcoding.
- Centralized Governance with Unity Catalog – Centralizes table management in Databricks—organized and governed.
- Scalability – A single pipeline processes 10 tables using
load_config.csv. Adding a new table requires only a config update—no code changes needed. - Performance Optimization – Parallelized
pl_emr_src_to_landing(batch count: 5), reducing ingestion time for large datasets. - Enhanced Flexibility – Introduced an
is_activeflag in the config file, allowing tables to be skipped without modifying the pipeline code. - Security Upgrade – Eliminated hardcoded credentials, migrating to Azure Key Vault for a production-ready, secure, and leak-proof implementation.
- Improved Efficiency – Mounted ADLS containers in Databricks, enabling faster access and cleaner code.
These enhancements transformed a prototype into a scalable, secure, and efficient system, ready to handle real-world RCM challenges.
This pipeline processes data from three primary sources, each ingested into the system based on its format and update frequency.
- Source: SQL Databases
- Data: Patient demographics, encounters, transactions, departments, insurance providers
- Update Frequency: Continuous
- Storage: Directly loaded into the Bronze Layer (Parquet format)
- Source: Flat files (CSV)
- Data: Monthly insurance claims submitted by hospitals
- Update Frequency: Monthly
- Storage: Initially stored in the Landing Layer (CSV format) before ingestion into Bronze
- Source: Public APIs
- Data:
- National Provider Identifier (NPI): Unique IDs assigned to healthcare providers
- ICD Codes: Standardized diagnostic codes
- Update Frequency: As per API availability
- Storage: Directly loaded into the Bronze Layer (Parquet format)
This pipeline follows a Medallion Architecture (Landing → Bronze → Silver → Gold), progressively refining raw hospital RCM data into business-ready insights. Each stage builds upon the last, handling multiple data sources (EMR databases, claims CSVs, APIs) through ingestion, transformation, and aggregation. The architecture ensures scalability, data quality, historization, and optimized performance for downstream analytics.
- Format: Parquet
- Objective: Standardize raw data ingestion for structured storage.
-
EMR Data Extraction
- Extracts Electronic Medical Records (EMR) from hospital databases.
- Supports full (entire dataset) and incremental (only new/modified records) loads.
- Data stored in the Bronze Layer (Parquet format), maintaining raw structure for historical tracking.
-
Claims Data Ingestion
- Insurance claims data arrives as CSV files in the Landing Layer.
- Converted to Parquet format in the Bronze Layer, optimizing for efficient storage and querying.
-
NPI & ICD Code Fetching
- Public APIs provide provider (NPI) and diagnosis (ICD) codes.
- Data is fetched programmatically and stored in the Bronze Layer for downstream enrichment.
- Columnar storage improves query speed.
- Compression (Snappy, Gzip) reduces storage footprint.
- Schema evolution support ensures flexibility.
- Format: Delta Tables
- Objective: Clean, unify, and historize data while maintaining integrity.
-
Data Cleaning & Standardization
- Schema unification (e.g.,
PatientIDfrom different hospitals mapped toSRC_PatientID). - Deduplication and null handling (e.g., flagging missing critical fields).
- Schema unification (e.g.,
-
Full Refresh for Static Tables
- Example: Department Listings
- Merges data from multiple sources, ensuring a current and de-duplicated view.
-
Slowly Changing Dimensions Type 2 (SCD2) for Dynamic Tables
- Example: Patient Data
- Tracks historical changes (e.g., address updates).
- Ensures previous records remain available while marking the latest as
is_current = true.
- ACID Transactions ensure consistency in merges and updates.
- Versioning allows rollback to previous states if needed.
- Optimized for high-performance transformations in a structured lakehouse environment.
- Format: Delta Tables
- Objective: Create business-ready, analytics-friendly data.
-
Fact Table Creation (
gld_transactions)- Aggregates claims and payment data.
- Ensures only valid and active records contribute to KPIs.
-
Dimension Table Creation (
gld_patients)- Stores historized patient details, ensuring only the latest, valid data is included.
-
Filtering & Quality Checks
is_current = trueensures only the latest active records are used.is_quarantined = falseremoves incomplete or invalid data.
- Business-friendly format for executives, analysts, and dashboards.
- Aggregated KPIs like Days in Accounts Receivable (AR).
- Simplifies reporting and visualization without the complexity of raw transactional data.
Below is a high-level overview of the pipeline’s data movement:
SQL DBs / APIs / CSVs
│
▼
Landing Layer (Raw Claims Data - CSV)
│
▼
Bronze Layer (Parquet - Unstructured, Source of Truth)
│
▼
Silver Layer (Delta Tables - Cleaned, Standardized, Historized)
│
▼
Gold Layer (Delta Tables - Aggregated, Business-Ready)
| Layer | Primary Users | Use Cases |
|---|---|---|
| Bronze | Data Engineers | Raw data ingestion, debugging, lineage tracking |
| Silver | Data Scientists, ML Experts, Analysts | Machine learning models, advanced analytics, trends |
| Gold | Business Users | Financial reporting, operational analytics, dashboards |
- Metadata Magic:
load_config.csvmade pipelines scalable—10 tables, one setup—automation at its best. - Full vs. SCD2: Full Refresh is simple for static data; SCD2’s complexity pays off for history (e.g., patient moves).
- Delta Power: Delta tables brought reliability—ACID and versioning beat Parquet’s static limits.
- Security Matters: Key Vault turned a prototype secure—essential for production DE.
- Troubleshooting: Fixed errors (e.g., SQL timeouts)—DE’s half coding, half detective work.
- Real-World Ready: Parallel runs, mounts, and flags prepped this for scale—learned by doing!
This repository is organized into the following directories and files:
-
/1. Set up/- Description: Contains resources and scripts necessary for the initial setup of the project environment.
- Purpose: Facilitates the configuration of infrastructure and services required for the data pipeline.
-
/2. API extracts/- Description: Includes scripts and configurations for extracting data from various APIs.
- Purpose: Automates the retrieval of external data sources to be integrated into the data pipeline.
-
/3. Silver/- Description: Holds scripts and processes related to the Silver layer of the Medallion architecture.
- Purpose: Focuses on cleaning, standardizing, and transforming raw data into a more refined format.
-
/4. Gold/- Description: Contains scripts and queries for the Gold layer.
- Purpose: Aggregates and prepares data for analytics and reporting, ensuring it's business-ready.
-
/DDL/- Description: Stores Data Definition Language scripts.
- Purpose: Defines the structure of databases, tables, and other schema-related components.
-
/datasets/- Description: Contains sample datasets used for development and testing.
- Purpose: Provides example data to simulate real-world scenarios and validate pipeline processes.
-
Gold Queries.sql- Description: SQL script containing queries for the Gold layer.
- Purpose: Executes data aggregation and transformation to produce final datasets for analysis.
-
Lookup_file_table_mapping.json- Description: JSON file mapping source files to their corresponding database tables.
- Purpose: Guides the ingestion process by specifying how data files correspond to database structures.
-
/images/- Description: Includes visual assets such as diagrams and flowcharts.
- Purpose: Supports documentation by providing visual representations of the project's architecture and workflows.
-
README.md- Description: The main documentation file for the repository.
- Purpose: Provides an overview of the project, including its purpose, structure, and usage instructions.
This structured layout ensures that all components of the data pipeline are organized and easily accessible, facilitating efficient development, collaboration, and maintenance.
Interested in enhancing this project? We welcome contributions! To get started, fork the repository, implement your improvements, and submit a pull request for review. Below are key areas where you can make an impact:
- Optimize SCD2 Performance: Improve the efficiency of Slowly Changing Dimension Type 2 processes to handle larger datasets or reduce processing time.
- Add More RCM KPIs: Expand the pipeline’s analytics capabilities by incorporating additional Revenue Cycle Management metrics, such as claim denial rate.
- Enhance Error Logging: Strengthen debugging and monitoring by improving the granularity and structure of error logs.
Your expertise can help elevate this pipeline—thank you for considering a contribution!
This project is licensed under the MIT License—see LICENSE for details.
Questions or ideas? Reach me at:
- GitHub: jitendrabhamare
- Email: jitendra@example.com
Happy engineering!

