This repository contains the code and necessary datasets for a novel bioinformatics framework that employs causality to identify genes potentially driving Long-COVID. Our approach integrates multi-omics data to find causal genes that influence the risk and prevention of Long-COVID, as well as the stability of its associated biological networks.
The goal of this project is to develop an advanced causal inference framework for multi-omics data in order to:
- Identify genes that may increase or decrease the risk of developing Long-COVID.
- Determine the role of these genes in maintaining the stability of biological networks affected by Long-COVID.
- Provide a ranked list of candidate genes that could be targeted for therapeutic intervention or further research.
Out framework integrates data from multiple omics layers (e.g., genomics, transcriptomics, proteomics) and calculate a score to identify key network drivers genes for Long-COVID.
Specifically, we:
Integrated Framework for Long-COVID Causal Gene Identification
This framework combines advanced techniques (Mendelian Randomization: Mt_Robin (Multi-tissue transcriptome-wide Mendelian Randomization method ROBust to INvalid instrumental variables) Mr.MtRobin Repository Reference and Control Theory: Controllability Analysis (CA) Reference) to identify key genes associated with Long-COVID by integrating data from GWAS, eQTL, PPI, and RNA-seq datasets. The focus is on identifying protein-coding genes that play crucial roles in disease mechanisms, applying rigorous selection criteria to ensure the relevance and robustness of the results.
-
Candidate SNP Identification: The framework begins by identifying candidate Single Nucleotide Polymorphisms (SNPs) that serve as instrumental variables (IVs) for each gene. The selection process is applied to all genes, with the final analysis focusing exclusively on protein-coding genes.
-
Threshold Criteria:
- LD Threshold: SNPs with a Linkage Disequilibrium (LD) correlation greater than 0.5 with another SNP in the pool are excluded.
- P-value Threshold 1: Only SNPs with a p-value less than 0.001 are considered significant.
- Number of Tissues Threshold: SNPs showing significant effects in at least one tissue are retained.
- P-value Threshold 2: Genes with a p-value and False Discovery Rate (FDR) less than or equal to 0.05 are retained.
-
Long-COVID Network: The framework evaluates the roles of these genes within the human protein-coding gene network, focusing on their degree (K), in-degree (Kin), and out-degree (Kout). The genes are classified into specific categories based on their indispensability and criticality in network control:
- Indispensable and Critical Genes: The analysis highlights genes that are crucial for network control, subdividing them into:
- Type I Critical Genes: Genes whose removal increases the number of driver nodes (N_D), indicating their essential role in maintaining control within the network.
- Type II Critical Genes: Genes that must always be controlled, belonging to all driver node sets, and characterized by zero in-degree (Kin=0).
- Indispensable and Critical Genes: The analysis highlights genes that are crucial for network control, subdividing them into:
-
Final Score Calculation:
For each gene, a weighted score is computed based on its contributions to risk and prevention, adjusted by a user-defined parameter (alpha). The final score is calculated by combining the normalized Mendelian Randomization Score (MR_Score_norm) and Control Theory Score (CT_Score_norm).We developed the following equation to identify the risk/preventive causal and network-critical genes for Long-COVID:
-
Summarize and rank genes:
Genes are ranked based on their Final Scores, with additional annotations indicating their effect (Risk/Preventive) and their criticality in biological networks, specifically highlighting whether they are classified as indispensable Type-I or Type-II critical genes in the context of Long-COVID. -
Generate a detailed report:
The top candidate genes are outputted to a CSV file, with a summary of key statistics provided. The report includes:- Total number of genes analyzed.
- Number of selected genes based on the Final Score.
- Number of Risk genes, Preventive genes, and Critical genes (Type-I and Type-II) for both the entire dataset and the selected subset.
The integrated framework filters and ranks genes based on their contributions to Long-COVID pathogenesis, with a special focus on those that are both indispensable and critical within the protein-coding gene network. This comprehensive approach ensures that the final set of identified genes is not only statistically significant but also biologically essential for network control, providing valuable insights into potential therapeutic targets for Long-COVID.
Indispensable Genes:
These are protein-coding genes crucial for maintaining the controllability of the Long-COVID network. If removed, they increase the number of driver nodes ((ND)) required to control the network. Indispensable genes are further classified into two categories, Type-I and Type-II, which reflect their different roles and impact across various network states.
Type-I Genes:
These genes are defined by their direct effect on the controllability of the network, as measured by their influence on the number of driver nodes ((ND)):
- Critical: The removal of a critical Type-I gene increases (ND), meaning the network requires more driver nodes for control when this gene is absent.
- Redundant: The removal of a redundant Type-I gene decreases (ND), meaning fewer driver nodes are needed to control the network when this gene is absent.
- Ordinary: The removal of an ordinary Type-I gene does not change (ND), meaning it has no effect on the number of driver nodes required for network control.
Type-II Genes:
These genes are classified based on their participation in driver node sets, which are sets of nodes that must be controlled to maintain network controllability:
- Critical: A critical Type-II gene must be present in all driver node sets, meaning it always plays a role in network control.
- Redundant: A redundant Type-II gene is absent from all driver node sets, meaning it is never required for network control.
- Ordinary: An ordinary Type-II gene is present in some but not all driver node sets, meaning it plays a role in network control in certain contexts but not universally.
Driver Nodes ((ND)):
Driver nodes are key nodes in the network that must be controlled to ensure the network's overall functionality. An increase in the number of driver nodes suggests a greater level of complexity or difficulty in controlling the network, while a decrease indicates a reduction in the effort required for control.
Code/: Contains the main code files:Main_Code.md: A markdown file describing the overall code structure and functionalities.generate_report_function.R: The primary R script containing the function to calculate Final Scores, rank genes, and generate a summary report.
Figures/: Directory containing visual assets such as workflow diagrams, causal score equations, and app screenshots used in the documentation.SM/: Supplementary Materials folder containing 9 subfolders with additional data, scripts, and resources related to the project.
- Windows 11 Enterprise and WSL-Ubuntu-22.04
- GADI Supercomputer (used for the First Part of the Project - MR):
- NCPUs: 48
- Memory: 500GB
- Job Specifications:
- Job 1: 48 CPUs, 391.58GB memory used
- Job 2: 36 CPUs, 77.45GB memory used
- Job 3: 36 CPUs, 9.67GB memory used
- Software Stack:
- R version 4.2.2 (via Docker)
- Libraries:
libcurl4-openssl-dev,libssl-dev,libxml2-dev,libicu-dev,liblapack-dev,libblas-dev,gfortran,nloptr,lme4,data.table,mvtnorm.
- R version 4.2.0 (R Project)
- biomaRt (version 2.58.0) (biomaRt)
- CancerSubtypes (CancerSubtypes)
- dplyr (version 1.1.4) (dplyr)
- tidyverse (version 2.0.0) (tidyverse)
- data.table (version 1.14.10) (data.table)
- lme4 (version 1.1-35.1) (lme4)
- ControllabilityAnalysis (version 1.0) (CA)
- httr (version 1.4.7) (httr)
- jsonlite (version 1.8.8) (jsonlite)
- VennDiagram (version 1.7.3) (VennDiagram)
- ggplot2 (version 3.4.4) (ggplot2)
- cowplot (version 1.1.2) (cowplot)
- visNetwork (version 2.1.2) (visNetwork)
- clusterProfiler (version 4.10.0) (clusterProfiler)
- ReactomePA (version 1.46.0) (ReactomePA)
- org.Hs.eg.db (version 3.18.0) (org.Hs.eg.db)
- Cytoscape (version 3.10.1) (Cytoscape)
The complete code implementation and all corresponding outputs are available in these Google Colab notebooks for your reference.
Option 1: Dynamic Causal Genes Visualization for Long-COVID Network
You can dynamically explore shifts in gene rankings using our interactive tool available at Dynamic Causal Genes Visualization in a Long-COVID Network. This tool allows users to adjust the parameter alpha, enabling a detailed examination of how genes transition from being network-critical (alpha = 0) to focusing more on disease risk or prevention (alpha = 1).
By adjusting (\alpha), you can customize insights into gene roles based on their specific research objectives, whether those objectives involve network dynamics, genetic risk factors, or a combination of both.
Option 2: Clone the Repository to Modify Results
-
Clone this repository to your local machine:
git clone https://github.com/SindyPin/Causal-Multiomics-Method
-
The input data is provided in the
input_data/directory. You do not need to supply any additional data. -
Load the R script and run the
generate_reportfunction with the provided dataset and desired parameters:General Results:
- Number of Risk genes: 406
- Number of Preventive genes: 402
- Number of TypeI Critical genes: 1334
- Number of TypeII Critical genes: 387
- Total number of genes: 6963
- Check
alpha=1.0:
source("generate_report_function.R")
generate_report(file_path = "input_data/mt_robin_CT.csv", alpha = 1, num_genes_to_print = 16, output_file = "results/gene_report.csv")Output:
- Total Number of Protein-coding genes in the dataset: 6339
- Total Number of Selected Protein-coding genes: 16
- Number of Selected Risk genes: 9
- Number of Selected Preventive genes: 7
- Number of Selected TypeI Critical genes: 0
- Number of Selected TypeII Critical genes: 0
| Rank | gene_name | MR_Score_norm | CT_Score_norm | Final_Score | Gene_Effect | Critical_Gene |
|---|---|---|---|---|---|---|
| 1 | MORN4 | 1.0000000 | 0 | 1.0000000 | Risk | Not_Critical |
| 2 | CDC26 | 0.8778914 | 0 | 0.8778914 | Preventive | Not_Critical |
| 3 | EIF5A | 0.6141077 | 0 | 0.6141077 | Risk | Not_Critical |
| 4 | VWDE | 0.4532133 | 0 | 0.4532133 | Risk | Not_Critical |
| 5 | GMPPB | 0.3882238 | 0 | 0.3882238 | Risk | Not_Critical |
| 6 | NDUFA6 | 0.2860991 | 0 | 0.2860991 | Risk | Not_Critical |
| 7 | BNIP1 | 0.2628553 | 0 | 0.2628553 | Risk | Not_Critical |
| 8 | CERS4 | 0.2280826 | 0 | 0.2280826 | Preventive | Not_Critical |
| 9 | ADAT1 | 0.2254065 | 0 | 0.2254065 | Preventive | Not_Critical |
| 10 | C19orf18 | 0.2252227 | 0 | 0.2252227 | Risk | Not_Critical |
| 11 | BOLA2 | 0.1787183 | 0 | 0.1787183 | Preventive | Not_Critical |
| 12 | CDA | 0.1774153 | 0 | 0.1774153 | Preventive | Not_Critical |
| 13 | MORN3 | 0.1429947 | 0 | 0.1429947 | Risk | Not_Critical |
| 14 | ITPRID1 | 0.1419264 | 0 | 0.1419264 | Preventive | Not_Critical |
| 15 | BTN3A1 | 0.1407982 | 0 | 0.1407982 | Preventive | Not_Critical |
| 16 | ATOSA | 0.0642437 | 0 | 0.0642437 | Risk | Not_Critical |
- Check
alpha=0.75:
source("generate_report_function.R")
generate_report(file_path = "input_data/mt_robin_CT.csv", alpha = 0.75, num_genes_to_print = 10, output_file = "results/gene_report.csv")Output:
- Total Number of Protein-coding genes in the dataset: 6339
- Total Number of Selected Protein-coding genes: 16
- Number of Selected Risk genes: 7
- Number of Selected Preventive genes: 3
- Number of Selected TypeI Critical genes: 6
- Number of Selected TypeII Critical genes: 0
| Rank | gene_name | MR_Score_norm | CT_Score_norm | Final_Score | Gene_Effect | Critical_Gene |
|---|---|---|---|---|---|---|
| 1 | MORN4 | 1.0000000 | 0.0000000 | 0.7500000 | Risk | Not_Critical |
| 2 | CDC26 | 0.8778914 | 0.0000000 | 0.6584185 | Preventive | Not_Critical |
| 3 | EIF5A | 0.6141077 | 0.0000000 | 0.4605808 | Risk | Not_Critical |
| 4 | VWDE | 0.4532133 | 0.0000000 | 0.3399100 | Risk | Not_Critical |
| 5 | GMPPB | 0.3882238 | 0.0000000 | 0.2911678 | Risk | Not_Critical |
| 6 | TP53 | 0.0000000 | 1.0000000 | 0.2500000 | No_Significant | TypeI |
| 7 | CREBBP | 0.0000000 | 0.9130435 | 0.2282609 | No_Significant | TypeI |
| 8 | EP300 | 0.0000000 | 0.9030100 | 0.2257525 | No_Significant | TypeI |
| 9 | NDUFA6 | 0.2860991 | 0.0000000 | 0.2145743 | Risk | Not_Critical |
| 10 | YWHAG | 0.0000000 | 0.8428094 | 0.2107023 | No_Significant | TypeI |
| 11 | BNIP1 | 0.2628553 | 0.0000000 | 0.1971415 | Risk | Not_Critical |
| 12 | SMAD3 | 0.0000000 | 0.7525084 | 0.1881271 | No_Significant | TypeI |
| 13 | GRB2 | 0.0000000 | 0.7023411 | 0.1755853 | No_Significant | TypeI |
| 14 | CERS4 | 0.2280826 | 0.0000000 | 0.1710619 | Preventive | Not_Critical |
| 15 | ADAT1 | 0.2254065 | 0.0000000 | 0.1690549 | Preventive | Not_Critical |
| 16 | C19orf18 | 0.2252227 | 0.0000000 | 0.1689170 | Risk | Not_Critical |
- Check
alpha=0.5:
source("generate_report_function.R")
generate_report(file_path = "input_data/mt_robin_CT.csv", alpha = 0.5, num_genes_to_print = 10, output_file = "results/gene_report.csv")Output:
- Total Number of Protein-coding genes in the dataset: 6339
- Total Number of Selected Protein-coding genes: 16
- Number of Selected Risk genes: 2
- Number of Selected Preventive genes: 1
- Number of Selected TypeI Critical genes: 13
- Number of Selected TypeII Critical genes: 0
| Rank | gene_name | MR_Score_norm | CT_Score_norm | Final_Score | Gene_Effect | Critical_Gene |
|---|---|---|---|---|---|---|
| 1 | TP53 | 0.0000000 | 1.0000000 | 0.5000000 | No_Significant | TypeI |
| 2 | MORN4 | 1.0000000 | 0.0000000 | 0.5000000 | Risk | Not_Critical |
| 3 | CREBBP | 0.0000000 | 0.9130435 | 0.4565217 | No_Significant | TypeI |
| 4 | EP300 | 0.0000000 | 0.9030100 | 0.4515050 | No_Significant | TypeI |
| 5 | CDC26 | 0.8778914 | 0.0000000 | 0.4389457 | Preventive | Not_Critical |
| 6 | YWHAG | 0.0000000 | 0.8428094 | 0.4214047 | No_Significant | TypeI |
| 7 | SMAD3 | 0.0000000 | 0.7525084 | 0.3762542 | No_Significant | TypeI |
| 8 | GRB2 | 0.0000000 | 0.7023411 | 0.3511706 | No_Significant | TypeI |
| 9 | SRC | 0.0000000 | 0.6521739 | 0.3260870 | No_Significant | TypeI |
| 10 | EIF5A | 0.6141077 | 0.0000000 | 0.3070538 | Risk | Not_Critical |
| 11 | AR | 0.0000000 | 0.5986622 | 0.2993311 | No_Significant | TypeI |
| 12 | ESR1 | 0.0000000 | 0.5819398 | 0.2909699 | No_Significant | TypeI |
| 13 | RB1 | 0.0000000 | 0.5652174 | 0.2826087 | No_Significant | TypeI |
| 14 | CSNK2A1 | 0.0000000 | 0.5518395 | 0.2759197 | No_Significant | TypeI |
| 15 | SMAD2 | 0.0000000 | 0.5384615 | 0.2692308 | No_Significant | TypeI |
| 16 | CDKN1A | 0.0000000 | 0.5284281 | 0.2642140 | No_Significant | TypeI |
- Check
alpha=0.25:
source("generate_report_function.R")
generate_report(file_path = "input_data/mt_robin_CT.csv", alpha = 0.25, num_genes_to_print = 10, output_file = "results/gene_report.csv")Output:
- Total Number of Protein-coding genes in the dataset: 6339
- Total Number of Selected Protein-coding genes: 16
- Number of Selected Risk genes: 0
- Number of Selected Preventive genes: 0
- Number of Selected TypeI Critical genes: 16
- Number of Selected TypeII Critical genes: 0
| Rank | gene_name | MR_Score_norm | CT_Score_norm | Final_Score | Gene_Effect | Critical_Gene |
|---|---|---|---|---|---|---|
| 1 | TP53 | 0 | 1.0000000 | 0.7500000 | No_Significant | TypeI |
| 2 | CREBBP | 0 | 0.9130435 | 0.6847826 | No_Significant | TypeI |
| 3 | EP300 | 0 | 0.9030100 | 0.6772575 | No_Significant | TypeI |
| 4 | YWHAG | 0 | 0.8428094 | 0.6321070 | No_Significant | TypeI |
| 5 | SMAD3 | 0 | 0.7525084 | 0.5643813 | No_Significant | TypeI |
| 6 | GRB2 | 0 | 0.7023411 | 0.5267559 | No_Significant | TypeI |
| 7 | SRC | 0 | 0.6521739 | 0.4891304 | No_Significant | TypeI |
| 8 | AR | 0 | 0.5986622 | 0.4489967 | No_Significant | TypeI |
| 9 | ESR1 | 0 | 0.5819398 | 0.4364548 | No_Significant | TypeI |
| 10 | RB1 | 0 | 0.5652174 | 0.4239130 | No_Significant | TypeI |
| 11 | CSNK2A1 | 0 | 0.5518395 | 0.4138796 | No_Significant | TypeI |
| 12 | SMAD2 | 0 | 0.5384615 | 0.4038462 | No_Significant | TypeI |
| 13 | CDKN1A | 0 | 0.5284281 | 0.3963211 | No_Significant | TypeI |
| 14 | MAPK1 | 0 | 0.5250836 | 0.3938127 | No_Significant | TypeI |
| 15 | FYN | 0 | 0.5117057 | 0.3837793 | No_Significant | TypeI |
| 16 | HDAC1 | 0 | 0.5050167 | 0.3787625 | No_Significant | TypeI |
- Check
alpha=0.00:
source("generate_report_function.R")
generate_report(file_path = "input_data/mt_robin_CT.csv", alpha = 0.00, num_genes_to_print = 10, output_file = "results/gene_report.csv")Output:
- Total Number of Protein-coding genes in the dataset: 6339
- Total Number of Selected Protein-coding genes: 16
- Number of Selected Risk genes: 0
- Number of Selected Preventive genes: 0
- Number of Selected TypeI Critical genes: 16
- Number of Selected TypeII Critical genes: 0
| Rank | gene_name | MR_Score_norm | CT_Score_norm | Final_Score | Gene_Effect | Critical_Gene |
|---|---|---|---|---|---|---|
| 1 | TP53 | 0 | 1.0000000 | 1.0000000 | No_Significant | TypeI |
| 2 | CREBBP | 0 | 0.9130435 | 0.9130435 | No_Significant | TypeI |
| 3 | EP300 | 0 | 0.9030100 | 0.9030100 | No_Significant | TypeI |
| 4 | YWHAG | 0 | 0.8428094 | 0.8428094 | No_Significant | TypeI |
| 5 | SMAD3 | 0 | 0.7525084 | 0.7525084 | No_Significant | TypeI |
| 6 | GRB2 | 0 | 0.7023411 | 0.7023411 | No_Significant | TypeI |
| 7 | SRC | 0 | 0.6521739 | 0.6521739 | No_Significant | TypeI |
| 8 | AR | 0 | 0.5986622 | 0.5986622 | No_Significant | TypeI |
| 9 | ESR1 | 0 | 0.5819398 | 0.5819398 | No_Significant | TypeI |
| 10 | RB1 | 0 | 0.5652174 | 0.5652174 | No_Significant | TypeI |
| 11 | CSNK2A1 | 0 | 0.5518395 | 0.5518395 | No_Significant | TypeI |
| 12 | SMAD2 | 0 | 0.5384615 | 0.5384615 | No_Significant | TypeI |
| 13 | CDKN1A | 0 | 0.5284281 | 0.5284281 | No_Significant | TypeI |
| 14 | MAPK1 | 0 | 0.5250836 | 0.5250836 | No_Significant | TypeI |
| 15 | FYN | 0 | 0.5117057 | 0.5117057 | No_Significant | TypeI |
| 16 | HDAC1 | 0 | 0.5050167 | 0.5050167 | No_Significant | TypeI |
- The function will output a ranked list of genes and a summary of the results to the specified output file.
- R version 4.0 or later
- R libraries:
dplyr,knitr,pander
Install the required libraries in R:
install.packages(c("dplyr", "knitr", "pander"))Applying the same framework, users can modify the input data to analyze other diseases and apply other MR and CT methods to explore different conditions or datasets.
We welcome contributions from the community! If you would like to contribute to this project, please fork the repository and submit a pull request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or issues, please open an issue in this repository or contact the project lead at sindypin2005@gmail.com.


