A new evolutive clustering algorithm (eSIMBA) for Active Module Identification in p-value Attributed Temporal Biological Networks
This repository contains the code needed to execute the eSIMBA algorithm and reproduice the results presented in [CITATION TO BE ADDED]. It also contains an optimized version of SIMBA algorithm presented in Singlan, N., Abou Choucha, F. & Pasquier, C. A new Similarity Based Adapted Louvain Algorithm (SIMBA) for active module identification in p-value attributed biological networks. Sci Rep 15, 11360 (2025). https://doi.org/10.1038/s41598-025-95749-6.
- Installation
- Reproduce paper results
- Usage on your own data
- Repository structure
- Data and format
- Requirements
- Cite
- Contact
To use this algorithm, it is needed to have Python 3.11 installed on your machine. Moreover, it is needed
to install the required packages described in the Requirements section and listed in the
environment.yml file.
To run the algorithm, it is needed to clone this repository, install the required packages and run the code using the commands described in the Reproduce paper results and the Usage on your own data sections.
Several launch scripts used in the paper are available in the scripts/ folder. They are POSIX shell scripts.
To launch all the provided scripts using a POSIX-compliant terminal (e.g., Linux, macOS, Git Bash on Windows) and
reproduce all the results of the paper using screen sessions, you can use the following command:
./scripts/launch_all.shThe main entry point is main.py.
To run eSIMBA (dynamic) on your own data you can use the following command:
python main.py -i .\data\YourData.zipTo run SIMBA (static) on your own data you can use the following command:
python main.py -i .\data\YourStaticGraph.npz --static The exact command-line arguments are defined in utils/utils.py. Below is a precise list with descriptions and default
values.
-
Input parameters
-i, --input_path(str, required): Path to the input data. For dynamic runs (eSIMBA) this must be a zip file containing one.npzfile per graph/time-step. For static runs (--static) a single.npzfile containing the graph is accepted. In dynamic setting, files should follow the naming template*_Time_{time_step}.npzwhere{time_step}is an integer.-gt, --ground_truth(flag): Indicates that ground-truth community labels are present in the data files.-n, --name(flag): Indicates that node names are present in the data files.
-
Output parameters
-r, --results_path(str, default:./Results): Path to the output results directory.-o, --output_prefix(str, default:output): Prefix for output files. Note: the prefix must not contain path separators or dots.-d, --draw(flag): Save plots of community evolution over time. This option is only valid when running eSIMBA ( dynamic).
-
Execution parameters
-s, --static(flag): Run SIMBA (static) instead of eSIMBA (dynamic).-p, --parallelism(flag): Enable parallelism. Parallelism is only available for eSIMBA (dynamic) runs.--debug(flag): Enable debug mode. When enabled, verbose output is automatically enabled and parallelism is disabled.--debug_limit(int, default: 5): Limit on the number of graphs to process in debug mode. Can only be set in debug mode.--shuffle(flag): Shuffle graphs before processing in debug mode. Only available when debug mode is enabled.
-
Algorithm parameters
-t, --threshold(float, default: 0.05): Threshold used during the filtering phase (applies to node p-values). Must be between 0 and 1.-mc, --min_community_size(int, default: 5): Minimum size for a community to be considered valid; smaller communities are discarded.-mwc, --min_window_community_size(int, default: 3): Minimum size for a community inside a time window (eSIMBA); smaller communities are discarded.-a, --approach(str, default:adaptive): Clustering approach. Accepted values:adaptive,best,worst. Theadaptiveapproach switches betweenbestandworstdepending on the number of edges to process in the current iteration.
-
Verbosity
-v, --verbose(flag): Enable verbose output. Note: enabling--debugforces verbose on.
- When running eSIMBA (dynamic) the
--input_pathmust be a.zipfile containing.npzfiles; when running SIMBA ( static) a single.npzfile is accepted. - If
--staticis used,--drawand--parallelismare not allowed. --debugautomatically sets--verboseand disables--parallelism.--shuffleand--debug_limitmay only be used when--debugis enabled.--thresholdmust be in [0, 1].--approachmust be one ofadaptive,best,worst.
main.py: main script to run SIMBA and eSIMBAclustering/: similarity-based clustering code (SIMBA)graph/: graph, node and cluster classesutils/: helper utilities (reading data, execution, metrics, saving, plotting)scripts/: provided launch scripts used in the paperdata/: example datasets (zip / npz)
The input data should be provided in .npz format. For eSIMBA (dynamic), the input should be a .zip file containing
one .npz file per graph/time-step. Each .npz file should follow the naming template *_Time_{time_step}.npz where
{time_step} is an integer representing the time step of the graph. For SIMBA (static), a single .npz file containing
the graph is accepted. Multiple graphs can be provided by using multiple .npz files in a .zip archive.
Each graph is stored as a .npz file. Expected keys are:
adjacency_data,adjacency_indices,adjacency_indptr,adjacency_shape— sparse adjacency matrix componentsfeature_data,feature_indices,feature_indptr,feature_shape— sparse feature matrix components (e.g., p-values)labels(optional) — ground-truth community labelslabel_indices(optional) — indices for ground-truth labelsevolution(optional) — evolution ground-truth (for dynamic graphs)evolution_indices(optional) — indices for evolution ground-truth (for dynamic graphs)name(optional) — node names
Most of the provided datasets are .zip archives containing one or more .npz files (one .npz per graph / per time
step for dynamic datasets). Below is the list of archives currently available in the data/ folder.
[dynamic_albert-barabasi_batra-SAE-VM.zip](data/dynamic_albert-barabasi_batra-SAE-VM.zip): 1000 graphs; 1000 nodes; 3 time steps; 10 communities; 10 nodes per community.[dynamic_evolutive_values_albert-barabasi_batra-SAE-VM.zip](data/dynamic_evolutive_values_albert-barabasi_batra-SAE-VM.zip): 1000 graphs; 1000 nodes; 3 time steps; 10 communities; 10 nodes per community.[subset_dynamic_rewire-PPI-STRING-Human-robinson.zip](data/subset_dynamic_rewire-PPI-STRING-Human-robinson.zip): 25 graphs; 16201 nodes; 3 time steps; 50 communities; 10 nodes per community. WARNING: this dataset is a subset of the fulldynamic_rewire-PPI-STRING-Human-robinson.zipdataset used in the paper, containing only 25 graphs instead of 100. It is provided for testing, as the full dataset can is too large to be added.[dynamic_rewire_with_modules-PPI-IntAct-Human-robinson.zip](data/dynamic_rewire_with_modules-PPI-IntAct-Human-robinson.zip): 100 graphs; 5784 nodes; 3 time steps; 50 communities; 10 nodes per community.[subset_dynamic_evolutive_values_rewire-PPI-STRING-Human-robinson.zip](data/subset_dynamic_evolutive_values_rewire-PPI-STRING-Human-robinson.zip): 100 graphs; 16201 nodes; 3 time steps; 50 communities; 10 nodes per community. WARNING: this dataset is a subset of the fulldynamic_evolutive_values_rewire-PPI-STRING-Human-robinson.zipdataset used in the paper, containing only 25 graphs instead of 100. It is provided for testing, as the full dataset can is too large to be added.[dynamic_evolutive_values_rewire_with_modules-PPI-IntAct-Human-robinson.zip](data/dynamic_evolutive_values_rewire_with_modules-PPI-IntAct-Human-robinson.zip): 100 graphs; 5784 nodes; 3 time steps; 50 communities; 10 nodes per community.
- Data from Viggars, M. R., Sutherland, H., Lanmüller, H., Schmoll, M., Bijak, M., & Jarvis, J. C. (2023). Adaptation of
the transcriptional response to resistance exercise over 4 weeks of daily training. FASEB journal : official
publication of the Federation of American Societies for Experimental Biology, 37(1),
e22686. https://doi.org/10.1096/fj.202201418R :
[Rat_Daily_Training.zip](data/Rat_Daily_Training.zip): 4 time steps.
- Data from Dumas, S. J., Meta, E., Borri, M., Goveia, J., Rohlenova, K., Conchinha, N. V., Falkenberg, K., Teuwen, L.
A., de Rooij, L., Kalucka, J., Chen, R., Khan, S., Taverna, F., Lu, W., Parys, M., De Legher, C., Vinckier, S.,
Karakach, T. K., Schoonjans, L., Lin, L., … Carmeliet, P. (2020). Single-Cell RNA Sequencing Reveals Renal Endothelium
Heterogeneity and Metabolic Adaptation to Water Deprivation. Journal of the American Society of Nephrology : JASN, 31(
1), 118–138. https://doi.org/10.1681/ASN.2019080832 :
[cRECs.zip](data/cRECs.zip): 3 time steps.[gRECs.zip](data/gRECs.zip): 4 time steps.[mRECs.zip](data/mRECs.zip): 4 time steps.
- Data from Pasquier, C., & Robichon, A. (2021). Temporal and sequential order of nonoverlapping gene networks unraveled
in mated female Drosophila. Life science alliance, 5(2), e202101119. https://doi.org/10.26508/lsa.202101119 :
[Drosophila_Amine_DESeq.zip](data/Drosophila_Amine_DESeq.zip): 3 time steps.[Drosophila_Amine_edgeR.zip](data/Drosophila_Amine_edgeR.zip): 3 time steps.
The code is written in Python 3.11.
The following packages are required (also listed in environment.yml):
- numpy
- scipy
- scikit-learn
- matplotlib
- networkx
- psutil
- pylint~=3.0
- openpyxl
- networkit
- plotly
- kaleido-core
- scikit-network
- pyunionfind
TO ADDIf you have any question, please contact me at singlan.nina@gmail.com