Skip to content

CasmirO-Source/Sequential-Clinical-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mortality Risk and Length of Stay Prediction: Sequential Modeling

This repository contains my personal contribution to a bioinformatics machine learning project focused on clinical outcome prediction. The task involved using synthetic patient data, inspired by MIMIC-III and MIMIC-IV, to assess the risk of in-hospital mortality and predict the length of hospital stay.

While the broader project explored various methods, this repository focuses on the sequential modeling approach I developed to capture the longitudinal nature of patient history.

Skills Obtained

  •   ETL Pipeline Development | Data Standardisation & Normalisation | Time-Series & Sequential Modelling | Clinical Feature Engineering
    
  •          Predictive Performance Diagnostics | Handling Class Imbalance | Model Threshold Optimisation
    
  •                TensorFlow | Schikit for Machine Learning | Data Visualisation
    

Project Overview

The dataset consists of 5,000 synthetic patient records spread across three main tables: demographic information, hospital admissions, and ICD diagnosis codes. My specific workflow involved:

  • Binary Classification: Predicting the hospital_expire_flag to identify in-hospital mortality risk.
  • Regression: Predicting the length of stay (LOS) for each patient.
  • Demographic Analysis: Investigating prediction performance across different ethnic groups.

Data Preprocessing and Feature Engineering

flowchart TD
    subgraph Raw_Data [Raw Clinical Tables]
        P[Patients Table]
        A[Admissions Table]
        D[Diagnoses Table]
        M[Merged Table]
        P ---> M
  A ---> M
  D ---> M
    end

    Raw_Data --> Integration[Data Integration: Join on Subject ID & Admission ID]
    
    subgraph Transformation [Feature Engineering]
        Integration --> Harmonise[ICD Harmonisation: ICD-9 to ICD-10 Mapping]
        Integration --> Temporal[Temporal Features: Calculate Length of Stay]
        Integration --> Ethnicity[Ethnicity Categorisation: Grouping into 4 Categories]
    end
    
    Harmonise --> Sequential[Sequential Encoding: Ordering by Admission Timestamps]
    Temporal --> Sequential
    Ethnicity --> Sequential
    
    Sequential --> Output[Final Structured Tensor for RNN/LSTM]

Loading

The preprocessing pipeline was designed to transform raw clinical data into a structured format suitable for recurrent neural networks. Key steps included:

  • Data Integration: Merging the patient, admission, and diagnosis tables into a single unified dataset using patient and admission identifiers.
  • ICD Harmonisation: Standardising a mix of ICD-9 and ICD-10 diagnosis codes into a consistent ICD-10 format.
  • Temporal Features: Calculating the length of stay by extracting the difference between admission and discharge timestamps.
  • Ethnicity Categorisation: Grouping self-reported race data into four primary categories: Black, White, Asian, and Other to ensure a robust performance analysis.
  • Sequential Encoding: Organising ICD codes by their sequence number within each admission to create a unique patient journey.

Model Development

I implemented a Long Short-Term Memory (LSTM) network to handle the longitudinal data. This architecture was chosen for its ability to capture dependencies within a sequence of clinical events, such as a patient's diagnosis history.

The model's performance was evaluated using the following metrics:

  • AUC-ROC
  • Precision and Recall
  • F1-score
  • R^2 Score

Results and Key Findings

The sequential model achieved an overall AUC-ROC of 0.92, indicating excellent performance in identifying mortality risk. While recall was high at 0.93, precision remained lower at 0.17. The model predicted the length of stay with a mean error of 127.9 hours.

What these results mean

I wanted to see if the model could "read" a patient's medical history like a story to predict their recovery. The high scores show it is very good at spotting patients who are at high risk, though it sometimes flags healthy patients as a precaution. It also provides a helpful estimate of how many hours a patient might need to stay in hospital.

Repository Contents

  • Data: Contains my synthetic data
  • visualisations: Contains my visualisations and evaluation metrics of the LTSM model
  • preprocessing.py: Includes all logic for data merging, ICD-10 unification, ethnicity grouping, and feature scaling.
  • sequential_model.py: Contains the LSTM architecture, training loops, and evaluation code.

About

I created a A Multi-Task Bidirectional LSTM with Attention designed for longitudinal clinical prediction. This model was trained on synthetic MIMIC data and processes the patients history to make predictions on the chance of their death based on the length of their stay

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages