Mortality Risk and Length of Stay Prediction: Sequential Modeling

This repository contains my personal contribution to a bioinformatics machine learning project focused on clinical outcome prediction. The task involved using synthetic patient data, inspired by MIMIC-III and MIMIC-IV, to assess the risk of in-hospital mortality and predict the length of hospital stay.

While the broader project explored various methods, this repository focuses on the sequential modeling approach I developed to capture the longitudinal nature of patient history.

Skills Obtained

  ETL Pipeline Development | Data Standardisation & Normalisation | Time-Series & Sequential Modelling | Clinical Feature Engineering

         Predictive Performance Diagnostics | Handling Class Imbalance | Model Threshold Optimisation

               TensorFlow | Schikit for Machine Learning | Data Visualisation

Project Overview

The dataset consists of 5,000 synthetic patient records spread across three main tables: demographic information, hospital admissions, and ICD diagnosis codes. My specific workflow involved:

Binary Classification: Predicting the hospital_expire_flag to identify in-hospital mortality risk.
Regression: Predicting the length of stay (LOS) for each patient.
Demographic Analysis: Investigating prediction performance across different ethnic groups.

Data Preprocessing and Feature Engineering

flowchart TD
    subgraph Raw_Data [Raw Clinical Tables]
        P[Patients Table]
        A[Admissions Table]
        D[Diagnoses Table]
        M[Merged Table]
        P ---> M
  A ---> M
  D ---> M
    end

    Raw_Data --> Integration[Data Integration: Join on Subject ID & Admission ID]
    
    subgraph Transformation [Feature Engineering]
        Integration --> Harmonise[ICD Harmonisation: ICD-9 to ICD-10 Mapping]
        Integration --> Temporal[Temporal Features: Calculate Length of Stay]
        Integration --> Ethnicity[Ethnicity Categorisation: Grouping into 4 Categories]
    end
    
    Harmonise --> Sequential[Sequential Encoding: Ordering by Admission Timestamps]
    Temporal --> Sequential
    Ethnicity --> Sequential
    
    Sequential --> Output[Final Structured Tensor for RNN/LSTM]

The preprocessing pipeline was designed to transform raw clinical data into a structured format suitable for recurrent neural networks. Key steps included:

Data Integration: Merging the patient, admission, and diagnosis tables into a single unified dataset using patient and admission identifiers.
ICD Harmonisation: Standardising a mix of ICD-9 and ICD-10 diagnosis codes into a consistent ICD-10 format.
Temporal Features: Calculating the length of stay by extracting the difference between admission and discharge timestamps.
Ethnicity Categorisation: Grouping self-reported race data into four primary categories: Black, White, Asian, and Other to ensure a robust performance analysis.
Sequential Encoding: Organising ICD codes by their sequence number within each admission to create a unique patient journey.

Model Development

I implemented a Long Short-Term Memory (LSTM) network to handle the longitudinal data. This architecture was chosen for its ability to capture dependencies within a sequence of clinical events, such as a patient's diagnosis history.

The model's performance was evaluated using the following metrics:

AUC-ROC
Precision and Recall
F1-score
R^2 Score

Results and Key Findings

The sequential model achieved an overall AUC-ROC of 0.92, indicating excellent performance in identifying mortality risk. While recall was high at 0.93, precision remained lower at 0.17. The model predicted the length of stay with a mean error of 127.9 hours.

What these results mean

I wanted to see if the model could "read" a patient's medical history like a story to predict their recovery. The high scores show it is very good at spotting patients who are at high risk, though it sometimes flags healthy patients as a precaution. It also provides a helpful estimate of how many hours a patient might need to stay in hospital.

Repository Contents

Data: Contains my synthetic data
visualisations: Contains my visualisations and evaluation metrics of the LTSM model
preprocessing.py: Includes all logic for data merging, ICD-10 unification, ethnicity grouping, and feature scaling.
sequential_model.py: Contains the LSTM architecture, training loops, and evaluation code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mortality Risk and Length of Stay Prediction: Sequential Modeling

Skills Obtained

Project Overview

Data Preprocessing and Feature Engineering

Model Development

Results and Key Findings

What these results mean

Repository Contents

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Data		Data
visualisations		visualisations
README.md		README.md
preprocessing.py		preprocessing.py
sequential_model.py		sequential_model.py

Folders and files

Latest commit

History

Repository files navigation

Mortality Risk and Length of Stay Prediction: Sequential Modeling

Skills Obtained

Project Overview

Data Preprocessing and Feature Engineering

Model Development

Results and Key Findings

What these results mean

Repository Contents

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages