This project implements a fully serverless, end-to-end data pipeline on AWS that ingests real-world healthcare staffing data from Google Drive, processes it through a structured multi-layer data lake, enforces data quality rules, and exposes analytics-ready datasets for querying and visualization.
Unlike simple ETL projects, this pipeline is designed to reflect real production systems, with a strong focus on reliability, traceability, and data integrity.
Key capabilities include:
- Automated ingestion of large external datasets (~214 MB) using AWS Lambda
- Structured data lake architecture (raw → refined → validation → curated)
- Explicit schema enforcement and data standardization
- Data quality validation with quarantine handling
- Partitioned Parquet datasets for efficient querying
- End-to-end orchestration using AWS Step Functions
- Monitoring, alerting, and failure recovery using CloudWatch, SNS, and SQS
In real-world data engineering, pipelines must do more than just move data — they must ensure that data is:
- Reliable → no silent failures or partial loads
- Traceable → every dataset can be tied back to its origin
- Reproducible → reruns do not corrupt or duplicate data
- Observable → failures are detectable and debuggable
This project was designed with those principles in mind, incorporating production-grade patterns that are often missing from portfolio projects.
A critical component of this pipeline is the use of manifest files during ingestion.
After each ingestion run, the pipeline generates a manifest file that records:
- File name
- Source location
- Ingestion timestamp
- S3 destination path
- Run identifier
Manifests enable several important capabilities:
The pipeline can safely rerun without duplicating or reprocessing the same files.
Every dataset in the data lake can be traced back to its exact ingestion event.
If a downstream failure occurs, the system can identify exactly which files were processed and replay only those datasets.
Provides a clear audit trail of what data was ingested, when, and from where — a requirement in many real-world data systems.
Downstream jobs (Glue, Athena, etc.) can rely on manifests instead of scanning entire S3 directories, improving performance and consistency.
- AWS Lambda retrieves data from Google Drive
- Raw data is stored in Amazon S3
- AWS Step Functions orchestrates the pipeline
- AWS Glue processes data through multiple layers:
- Refined
- Validation
- Curated
- Amazon Athena provides querying capabilities
- Amazon QuickSight delivers dashboards
- CloudWatch + SNS handle monitoring and alerts
- SQS captures failed executions (dead-letter queue)
- AWS Lambda
- Amazon S3
- AWS Step Functions
- AWS Glue
- Amazon Athena
- Amazon QuickSight
- Amazon CloudWatch
- Amazon SNS
- Amazon SQS
- AWS Secrets Manager
- Python
- One row per provider per day
- Includes staffing hours and patient census
- Provider metadata
- Includes:
- Provider ID (CCN)
- Bed count
- Hospital affiliation
- Rating
- Stores original CSV files
- Immutable
- Partitioned by ingestion date
- Standardized column names
- Explicit data typing
- Preserved leading zeroes for 'provnum' field
- Added metadata fields
- Analytics-ready datasets
- Stored as Parquet
- Partitioned by:
- work_year
- work_month
Pipeline flow:
- Invoke ingestion Lambda
- Run refine Glue job
- Run validation Glue job
- Run curated Glue job
- Retry logic
- Error handling (Catch)
- CloudWatch logging
- SQS dead-letter queue integration
- Uses Google Drive API (OAuth)
- Credentials stored in Secrets Manager
- Streams large files (~214 MB)
- Optimized for memory and speed
- Standardizes schema
- Casts data types
- Preserves identifiers
- Adds metadata columns
- Referential integrity checks
- Null checks
- Schema validation
- 1,547 rows missing provider references
Instead of dropping data:
- Created inferred provider records
- Added flags:
- provider_reference_gap_flag
- provider_reference_source
- Creates fact and dimension tables
- Derives partition columns
- Writes Parquet output
- Grain: provider per day
- Contains staffing and census data
- Provider attributes
- Includes ratings, bed counts, hospital flags
- Tables created via manual DDL
- Partitioned by year and month
- Optimized for query performance
- Hours Reported by Nurse Type
- Employee vs Contractor Hours
- Avg Overall Rating by Providers Located In/Out of Hospitals
- Count of Providers Located In/Out of Hospitals
- Total Hours Reported by Provider
- CloudWatch logs for all services (only Lambda logs shown above)
- SNS alerts for failures
- Captures failed pipeline executions
- Stores error context for debugging
- Enables replay and recovery
Chosen for:
- Better error handling
- Native retries
- Cross-service orchestration
Chosen for:
- Strict schema control
- Predictability
- Avoiding inference errors
Some provider information was missing, so instead of dropping:
- Created inferred dimension records
- Flagged them for transparency
- Lambda optimized for large file ingestion
- Glue scales horizontally
- Step Functions supports high concurrency
This pipeline successfully:
- Ingests external data securely
- Enforces data quality
- Handles edge cases
- Produces analytics-ready datasets
- Supports real-time business insights
- End-to-end data pipeline design
- AWS serverless architecture expertise
- Data modeling and warehousing
- Real-world data quality handling
- Observability and reliability patterns
- Add CI/CD pipeline (GitHub Actions)
- Implement data cataloging (Glue Data Catalog enhancements)
- Add automated data quality dashboards
- Introduce streaming ingestion for real-time processing
Built by Johnathon Smith, a data engineer focused on designing production-ready, scalable data systems.








