GitHub - trinathpanda/FHIR2CDISC-Pilot

Title: FHIR2CDISC-Pilot: A Metadata-Driven Clinical Programming Pipeline Duration: 12 Weeks (≈ 3 Months) Goal: Rebuild executional fluency in SAS, R, and Python and prove the ability to automate and validate an end-to-end clinical programming pipeline that transforms SDTM → ADaM → TLF, all driven by metadata. Outcome: A public GitHub repository and a whitepaper demonstrating you can design, code, and publish a standards-driven automation system.

⚙️ Environment & Prerequisites

Software Setup

Stack Tool Purpose

SAS SAS OnDemand for Academics Clinical programming, SDTM → ADaM derivations, ODS outputs R R + RStudio Analytical and reporting equivalence; visualization Python Python 3.11+, JupyterLab, VS Code Automation engine, CLI, metadata scripts DevOps Git + GitHub Version control and publication Optional Docker Reproducibility & deployment container

Python Libraries

pip install pandas numpy tabulate plotly jinja2 pytest lxml pyyaml dsjson scipy

R Packages

install.packages(c("dplyr","gtsummary","gt","ggplot2","xml2","testthat","readr"))

Folder Structure

In your local machine (and GitHub repo):

FHIR2CDISC-Pilot/ │ ├── data/ │ ├── raw/ │ ├── sdtm/ │ └── adam/ ├── scripts/ │ ├── sas/ │ ├── r/ │ └── python/ ├── outputs/ ├── tests/ └── docs/

Data Setup

Download the CDISC Pilot SDTM/ADaM datasets (XPT files): DM.xpt, AE.xpt, LB.xpt, EX.xpt, ADSL.xpt, ADAE.xpt From: https://github.com/cdisc-org/sdtm-adam-pilot-project → Place them into data/raw/.

📅 12-Week Detailed Roadmap

PHASE 1: Reactivation (Weeks 1–3)

Objective: Rebuild syntax memory, analytical reflexes, and formatted output fluency.

Week 1 – Diagnostic & Environment Setup

Goal: Verify all tools, packages, and folders are operational.

Tasks:

Import DM.xpt in SAS, R, Python.

Run descriptive statistics (AGE by TRT01P).

Verify outputs are equivalent across all stacks.

Document your ease/difficulty in Reflection.md.

Example Outputs:

TRT01P N Mean Age SD Min Max

Deliverables:

3 scripts (SAS, R, Python)

“Week1_Diagnostics” commit in GitHub.

Weeks 2–3 – TLF Reactivation

Goal: Produce formatted clinical outputs (Table, Listing, Figure) across stacks.

Outputs:

Table 1 – Demographics Summary

AGE, SEX, RACE by TRT01P

Listing 1 – Adverse Events

USUBJID, AETERM, AESEV, AESER, AESTDTC, AEENDTC

Figure 1 – AE Severity by Treatment (Bar Chart)

SAS:

Use PROC MEANS, PROC REPORT, ODS RTF.

R:

dplyr + gtsummary + gt for publication-ready tables.

Python:

pandas + plotly for figure output.

tabulate for text-style tables.

Analytical Refresh:

t-test / ANOVA for AGE

Chi-square for SEX vs TRT01P

Deliverables:

Matching outputs across stacks

Folder: outputs/week2_3/

Commit: Week3_TLF_Reactivation_Complete

PHASE 2: Integration (Weeks 4–6)

Objective: Derive ADaM datasets from SDTM and validate transformations across stacks.

Weeks 4–5 – SDTM to ADaM Derivation

Goal: Demonstrate dataset lineage and variable consistency.

Tasks:

Convert pilot SDTM datasets (DM, AE) into structured CSV via Python.

Create ADSL (subject-level) and ADAE (event-level) in SAS and R.

Develop config.json defining derivation metadata (e.g., source variable mappings).

Write a Python script to read this config and auto-generate ADaM skeletons.

Deliverables:

SDTM and ADaM datasets under data/sdtm/ and data/adam/

config.json metadata file

Validation report in docs/validation_log.md

Commit: Week5_SDTM_ADaM_Integration

Week 6 – Analytical Validation

Goal: Implement automated validation between ADaM and SDTM outputs.

Tasks:

Compute AE incidence rates (% subjects with AE).

Compare results across stacks (SAS vs R vs Python).

Build unit tests:

Python: pytest

R: testthat

SAS: simple %assert macro

Deliverables:

AE summaries in /outputs/

Validation logs

Passing unit tests

Commit: Week6_Validation_Complete

PHASE 3: Automation (Weeks 7–9)

Objective: Transform scripts into a reusable automation system.

Week 7 – Metadata-Driven Architecture

Goal: Make code fully configuration-driven.

Tasks:

Create a Python driver script that reads metadata (config.yaml or config.json).

Build templates using jinja2 for generating dataset code dynamically.

Enable command-line execution:

python -m fhir2cdisc --input data/raw --output data/sdtm

Implement error handling and log messages.

Deliverables:

CLI fhir2cdisc functional

Config-driven output generation

Week 8 – Continuous Integration

Goal: Add reproducibility checks.

Tasks:

Create .github/workflows/ci.yml:

Runs Python unit tests (pytest)

Runs R checks (testthat)

Validate outputs automatically on commit.

Capture test logs as build artifacts.

Deliverables:

Passing CI pipeline on GitHub Actions

Commit: Week8_CI_Workflow_Built

Week 9 – Containerization (Optional Stretch)

Goal: Make pipeline reproducible everywhere.

Tasks:

Write a Dockerfile that installs Python, R, SASPy (optional), and your project.

Validate that docker run reproduces outputs.

Deliverables:

Docker image buildable locally

Commit: Week9_Docker_Ready

PHASE 4: Demonstration (Weeks 10–12)

Objective: Build final TLFs, generate Define-XML, and publish publicly.

Week 10 – Define-XML & Dataset-JSON Integration

Goal: Demonstrate standards automation.

Tasks:

Generate a basic define.xml from your SDTM/ADaM metadata.

from lxml import etree root = etree.Element("Define") etree.SubElement(root, "StudyName").text = "CDISC Pilot Study" etree.ElementTree(root).write("data/define.xml", pretty_print=True)

Cross-check metadata against datasets.

Validate Dataset-JSON using dsjson.

Deliverables:

define.xml

dataset-metadata.json

Validation log in /docs/

Commit: Week10_Standards_Integration

Weeks 11–12 – Final Integration & Publication

Goal: Package and publish the entire pipeline.

Tasks:

Run full pipeline end-to-end (SDTM → ADaM → TLF).

Ensure reproducibility across SAS, R, Python.

Create final documentation in /docs/final_summary.md.

Write LinkedIn article: “Building a Metadata-Driven Clinical Programming Pipeline”

Publish GitHub repo publicly.

Deliverables:

Public repo with README.md, images, and sample outputs.

LinkedIn publication under Dr. CliniData.

Optional 2-min video demo.

Commit: Final_Integration_Complete

🎯 Final Outcomes & Deliverables

Type Deliverable Demonstrates

Code SAS, R, Python scripts Multilingual clinical programming fluency Data SDTM & ADaM datasets Standards transformation logic Automation Python CLI fhir2cdisc Engineering design & metadata control Validation Unit tests, QC logs Reproducibility & QA rigor Outputs TLFs (RTF, HTML, PNG) Analytical equivalence Documentation README, define.xml, lineage diagram Standards compliance DevOps CI/CD workflow Engineering discipline Visibility GitHub repo + LinkedIn article Professional credibility

🧩 After-Project Extensions (Post Week 12)

Once complete, extend your system with:

Real FHIR → SDTM mapping (Patient, Observation).

Full Define-XML writer/reader automation.

Advanced analytics (KM, logistic regression).

R Shiny or Dash dashboard for interactive visualization.

🧠 End Result

By Week 12 you will have:

Hands-on fluency restored across SAS, R, and Python.
Validated SDTM → ADaM → TLF pipeline, metadata-driven.
Automated reproducibility system (CLI + CI/CD).
A public GitHub project + technical article proving cross-stack competence.

This is your proof-of-competence artifact — a complete clinical data science pipeline integrating programming, standards, and automation in one demonstrable system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/raw		data/raw
outputs		outputs
scripts		scripts
.gitignore		.gitignore
FHIR2CDISC-PILOT.Rproj		FHIR2CDISC-PILOT.Rproj
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages