This repository contains my personal contribution to a bioinformatics machine learning project focused on clinical outcome prediction. The task involved using synthetic patient data, inspired by MIMIC-III and MIMIC-IV, to assess the risk of in-hospital mortality and predict the length of hospital stay.
While the broader project explored various methods, this repository focuses on the sequential modeling approach I developed to capture the longitudinal nature of patient history.
-
ETL Pipeline Development | Data Standardisation & Normalisation | Time-Series & Sequential Modelling | Clinical Feature Engineering -
Predictive Performance Diagnostics | Handling Class Imbalance | Model Threshold Optimisation -
TensorFlow | Schikit for Machine Learning | Data Visualisation
The dataset consists of 5,000 synthetic patient records spread across three main tables: demographic information, hospital admissions, and ICD diagnosis codes. My specific workflow involved:
- Binary Classification: Predicting the
hospital_expire_flagto identify in-hospital mortality risk. - Regression: Predicting the length of stay (LOS) for each patient.
- Demographic Analysis: Investigating prediction performance across different ethnic groups.
flowchart TD
subgraph Raw_Data [Raw Clinical Tables]
P[Patients Table]
A[Admissions Table]
D[Diagnoses Table]
M[Merged Table]
P ---> M
A ---> M
D ---> M
end
Raw_Data --> Integration[Data Integration: Join on Subject ID & Admission ID]
subgraph Transformation [Feature Engineering]
Integration --> Harmonise[ICD Harmonisation: ICD-9 to ICD-10 Mapping]
Integration --> Temporal[Temporal Features: Calculate Length of Stay]
Integration --> Ethnicity[Ethnicity Categorisation: Grouping into 4 Categories]
end
Harmonise --> Sequential[Sequential Encoding: Ordering by Admission Timestamps]
Temporal --> Sequential
Ethnicity --> Sequential
Sequential --> Output[Final Structured Tensor for RNN/LSTM]
The preprocessing pipeline was designed to transform raw clinical data into a structured format suitable for recurrent neural networks. Key steps included:
- Data Integration: Merging the patient, admission, and diagnosis tables into a single unified dataset using patient and admission identifiers.
- ICD Harmonisation: Standardising a mix of ICD-9 and ICD-10 diagnosis codes into a consistent ICD-10 format.
- Temporal Features: Calculating the length of stay by extracting the difference between admission and discharge timestamps.
- Ethnicity Categorisation: Grouping self-reported race data into four primary categories: Black, White, Asian, and Other to ensure a robust performance analysis.
- Sequential Encoding: Organising ICD codes by their sequence number within each admission to create a unique patient journey.
I implemented a Long Short-Term Memory (LSTM) network to handle the longitudinal data. This architecture was chosen for its ability to capture dependencies within a sequence of clinical events, such as a patient's diagnosis history.
The model's performance was evaluated using the following metrics:
- AUC-ROC
- Precision and Recall
- F1-score
- R^2 Score
The sequential model achieved an overall AUC-ROC of 0.92, indicating excellent performance in identifying mortality risk. While recall was high at 0.93, precision remained lower at 0.17. The model predicted the length of stay with a mean error of 127.9 hours.
I wanted to see if the model could "read" a patient's medical history like a story to predict their recovery. The high scores show it is very good at spotting patients who are at high risk, though it sometimes flags healthy patients as a precaution. It also provides a helpful estimate of how many hours a patient might need to stay in hospital.
Data: Contains my synthetic datavisualisations: Contains my visualisations and evaluation metrics of the LTSM modelpreprocessing.py: Includes all logic for data merging, ICD-10 unification, ethnicity grouping, and feature scaling.sequential_model.py: Contains the LSTM architecture, training loops, and evaluation code.