This repository provides a pip-installable software package to run Tucan for methylation-based tumor classification in a clinical setting.
-
Python >= 3.9
-
Dependencies and version constraints are defined in
pyproject.toml: https://github.com/UMCUGenetics/tucan/blob/main/pyproject.toml -
The software has been tested on Linux (Ubuntu 20.04) and macOS.
-
No non-standard hardware is required (CPU sufficient; GPU optional).
Installation time: ~5–10 minutes on a standard desktop computer.
git clone https://github.com/UMCUGenetics/tucan.gitcd tucan- Install the project in 'editable' mode
pip install -e . - Download the pretrained model from Hugging Face
pip install -U huggingface_hub python -c "from src.tucan.download_model import get_model; print(get_model())"
- changle dir
cd models - zip file
zip -r model.zip model
The model path returned by this command should be provided to the -m argument.
Usage: tucan [-h] [-i INPUT_FILE] [-m MODEL] [-c NUM_CPGS] [-o OUTPUT_FILE] [-s NUM_SAMPLINGS]
[-f FILE_TYPE]
Options:
-h, --help show this help message and exit
-i INPUT_FILE, --input_file INPUT_FILE
path to input file
-m MODEL, --model MODEL
specify path to model zip you want to use.
-c NUM_CPGS, --num_CpGs NUM_CPGS
specify the number of samples CpG sites (default is to use all available sites).
-o OUTPUT_FILE, --output_file OUTPUT_FILE
path to output file
-s NUM_SAMPLINGS, --num_samplings NUM_SAMPLINGS
Specify the number of random samples of size num_CpGs. Default is 1 random
sampling.
-f FILE_TYPE, --file_type FILE_TYPE
input file type 'bed' or 'csv'
Recommendation: For optimal performance, set -c between 10,000 and 20,000 CpG sites. Using too many CpG sites may cause the model to become overconfident and increase the likelihood of misclassification.
Tucan accepts input in either BED or CSV format, containing the methylation calls.
An example BED file snippet is available here: data/bedExample.bed.
Methylation calls can be extracted from a Nanopore BAM file using modkit.
Detailed instructions for this process are available in the Sturgeon repository.
Note: The Sturgeon implementation determines methylation status at CpG sites based on the Illumina methylation array coordinates, including a ±25 bp window around each CpG site.
This means methylation calls are aggregated not only at the exact array position but also within this surrounding margin.
A small example dataset is provided in the repository under demo_tucan/Data, along with the corresponding expected output in demo_tucan/Output.
tucan -i demo_tucan/Data/testSample1.merged_probes_methyl_calls.bed -m <path_to_model> -c 10000 -o demo_tucan/Output/OutcomeTestSample1.csv -s 1 -f csv
Expected runtime: ~1–5 minutes per sample on a standard CPU.
- Extract methylation calls from Nanopore BAM files (e.g. using modkit)
- Convert to BED or CSV format
- Run Tucan using the command above
The example dataset and output provided in demo_tucan can be used to verify correct installation and execution of the Tucan software. The full methodology is described in the manuscript Methods section.
See the full table: docs/tumor_abbreviation.md