Title: FHIR2CDISC-Pilot: A Metadata-Driven Clinical Programming Pipeline Duration: 12 Weeks (≈ 3 Months) Goal: Rebuild executional fluency in SAS, R, and Python and prove the ability to automate and validate an end-to-end clinical programming pipeline that transforms SDTM → ADaM → TLF, all driven by metadata. Outcome: A public GitHub repository and a whitepaper demonstrating you can design, code, and publish a standards-driven automation system.
⚙️ Environment & Prerequisites
- Software Setup
Stack Tool Purpose
SAS SAS OnDemand for Academics Clinical programming, SDTM → ADaM derivations, ODS outputs R R + RStudio Analytical and reporting equivalence; visualization Python Python 3.11+, JupyterLab, VS Code Automation engine, CLI, metadata scripts DevOps Git + GitHub Version control and publication Optional Docker Reproducibility & deployment container
- Python Libraries
pip install pandas numpy tabulate plotly jinja2 pytest lxml pyyaml dsjson scipy
- R Packages
install.packages(c("dplyr","gtsummary","gt","ggplot2","xml2","testthat","readr"))
- Folder Structure
In your local machine (and GitHub repo):
FHIR2CDISC-Pilot/ │ ├── data/ │ ├── raw/ │ ├── sdtm/ │ └── adam/ ├── scripts/ │ ├── sas/ │ ├── r/ │ └── python/ ├── outputs/ ├── tests/ └── docs/
- Data Setup
Download the CDISC Pilot SDTM/ADaM datasets (XPT files): DM.xpt, AE.xpt, LB.xpt, EX.xpt, ADSL.xpt, ADAE.xpt From: https://github.com/cdisc-org/sdtm-adam-pilot-project → Place them into data/raw/.
📅 12-Week Detailed Roadmap
PHASE 1: Reactivation (Weeks 1–3)
Objective: Rebuild syntax memory, analytical reflexes, and formatted output fluency.
Week 1 – Diagnostic & Environment Setup
Goal: Verify all tools, packages, and folders are operational.
Tasks:
Import DM.xpt in SAS, R, Python.
Run descriptive statistics (AGE by TRT01P).
Verify outputs are equivalent across all stacks.
Document your ease/difficulty in Reflection.md.
Example Outputs:
TRT01P N Mean Age SD Min Max
Deliverables:
3 scripts (SAS, R, Python)
“Week1_Diagnostics” commit in GitHub.
Weeks 2–3 – TLF Reactivation
Goal: Produce formatted clinical outputs (Table, Listing, Figure) across stacks.
Outputs:
- Table 1 – Demographics Summary
AGE, SEX, RACE by TRT01P
- Listing 1 – Adverse Events
USUBJID, AETERM, AESEV, AESER, AESTDTC, AEENDTC
- Figure 1 – AE Severity by Treatment (Bar Chart)
SAS:
Use PROC MEANS, PROC REPORT, ODS RTF.
R:
dplyr + gtsummary + gt for publication-ready tables.
Python:
pandas + plotly for figure output.
tabulate for text-style tables.
Analytical Refresh:
t-test / ANOVA for AGE
Chi-square for SEX vs TRT01P
Deliverables:
Matching outputs across stacks
Folder: outputs/week2_3/
Commit: Week3_TLF_Reactivation_Complete
PHASE 2: Integration (Weeks 4–6)
Objective: Derive ADaM datasets from SDTM and validate transformations across stacks.
Weeks 4–5 – SDTM to ADaM Derivation
Goal: Demonstrate dataset lineage and variable consistency.
Tasks:
Convert pilot SDTM datasets (DM, AE) into structured CSV via Python.
Create ADSL (subject-level) and ADAE (event-level) in SAS and R.
Develop config.json defining derivation metadata (e.g., source variable mappings).
Write a Python script to read this config and auto-generate ADaM skeletons.
Deliverables:
SDTM and ADaM datasets under data/sdtm/ and data/adam/
config.json metadata file
Validation report in docs/validation_log.md
Commit: Week5_SDTM_ADaM_Integration
Week 6 – Analytical Validation
Goal: Implement automated validation between ADaM and SDTM outputs.
Tasks:
Compute AE incidence rates (% subjects with AE).
Compare results across stacks (SAS vs R vs Python).
Build unit tests:
Python: pytest
R: testthat
SAS: simple %assert macro
Deliverables:
AE summaries in /outputs/
Validation logs
Passing unit tests
Commit: Week6_Validation_Complete
PHASE 3: Automation (Weeks 7–9)
Objective: Transform scripts into a reusable automation system.
Week 7 – Metadata-Driven Architecture
Goal: Make code fully configuration-driven.
Tasks:
Create a Python driver script that reads metadata (config.yaml or config.json).
Build templates using jinja2 for generating dataset code dynamically.
Enable command-line execution:
python -m fhir2cdisc --input data/raw --output data/sdtm
Implement error handling and log messages.
Deliverables:
CLI fhir2cdisc functional
Config-driven output generation
Week 8 – Continuous Integration
Goal: Add reproducibility checks.
Tasks:
Create .github/workflows/ci.yml:
Runs Python unit tests (pytest)
Runs R checks (testthat)
Validate outputs automatically on commit.
Capture test logs as build artifacts.
Deliverables:
Passing CI pipeline on GitHub Actions
Commit: Week8_CI_Workflow_Built
Week 9 – Containerization (Optional Stretch)
Goal: Make pipeline reproducible everywhere.
Tasks:
Write a Dockerfile that installs Python, R, SASPy (optional), and your project.
Validate that docker run reproduces outputs.
Deliverables:
Docker image buildable locally
Commit: Week9_Docker_Ready
PHASE 4: Demonstration (Weeks 10–12)
Objective: Build final TLFs, generate Define-XML, and publish publicly.
Week 10 – Define-XML & Dataset-JSON Integration
Goal: Demonstrate standards automation.
Tasks:
Generate a basic define.xml from your SDTM/ADaM metadata.
from lxml import etree root = etree.Element("Define") etree.SubElement(root, "StudyName").text = "CDISC Pilot Study" etree.ElementTree(root).write("data/define.xml", pretty_print=True)
Cross-check metadata against datasets.
Validate Dataset-JSON using dsjson.
Deliverables:
define.xml
dataset-metadata.json
Validation log in /docs/
Commit: Week10_Standards_Integration
Weeks 11–12 – Final Integration & Publication
Goal: Package and publish the entire pipeline.
Tasks:
Run full pipeline end-to-end (SDTM → ADaM → TLF).
Ensure reproducibility across SAS, R, Python.
Create final documentation in /docs/final_summary.md.
Write LinkedIn article: “Building a Metadata-Driven Clinical Programming Pipeline”
Publish GitHub repo publicly.
Deliverables:
Public repo with README.md, images, and sample outputs.
LinkedIn publication under Dr. CliniData.
Optional 2-min video demo.
Commit: Final_Integration_Complete
🎯 Final Outcomes & Deliverables
Type Deliverable Demonstrates
Code SAS, R, Python scripts Multilingual clinical programming fluency Data SDTM & ADaM datasets Standards transformation logic Automation Python CLI fhir2cdisc Engineering design & metadata control Validation Unit tests, QC logs Reproducibility & QA rigor Outputs TLFs (RTF, HTML, PNG) Analytical equivalence Documentation README, define.xml, lineage diagram Standards compliance DevOps CI/CD workflow Engineering discipline Visibility GitHub repo + LinkedIn article Professional credibility
🧩 After-Project Extensions (Post Week 12)
Once complete, extend your system with:
Real FHIR → SDTM mapping (Patient, Observation).
Full Define-XML writer/reader automation.
Advanced analytics (KM, logistic regression).
R Shiny or Dash dashboard for interactive visualization.
🧠 End Result
By Week 12 you will have:
-
Hands-on fluency restored across SAS, R, and Python.
-
Validated SDTM → ADaM → TLF pipeline, metadata-driven.
-
Automated reproducibility system (CLI + CI/CD).
-
A public GitHub project + technical article proving cross-stack competence.
This is your proof-of-competence artifact — a complete clinical data science pipeline integrating programming, standards, and automation in one demonstrable system.