Skip to content

UMCUGenetics/tucan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tucan

This repository provides a pip-installable software package to run Tucan for methylation-based tumor classification in a clinical setting.

System requirements

Installation

Installation time: ~5–10 minutes on a standard desktop computer.

  • git clone https://github.com/UMCUGenetics/tucan.git
  • cd tucan
  • Install the project in 'editable' mode pip install -e .
  • Download the pretrained model from Hugging Face
pip install -U huggingface_hub
python -c "from src.tucan.download_model import get_model; print(get_model())"
  • changle dir cd models
  • zip file zip -r model.zip model

The model path returned by this command should be provided to the -m argument.

Usage

Usage: tucan [-h] [-i INPUT_FILE] [-m MODEL] [-c NUM_CPGS] [-o OUTPUT_FILE] [-s NUM_SAMPLINGS]
                   [-f FILE_TYPE]

Options:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file INPUT_FILE
                        path to input file
  -m MODEL, --model MODEL
                        specify path to model zip you want to use.
  -c NUM_CPGS, --num_CpGs NUM_CPGS
                        specify the number of samples CpG sites (default is to use all available sites).
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        path to output file
  -s NUM_SAMPLINGS, --num_samplings NUM_SAMPLINGS
                        Specify the number of random samples of size num_CpGs. Default is 1 random
                        sampling.
  -f FILE_TYPE, --file_type FILE_TYPE
                        input file type 'bed' or 'csv'

Recommendation: For optimal performance, set -c between 10,000 and 20,000 CpG sites. Using too many CpG sites may cause the model to become overconfident and increase the likelihood of misclassification.

Preparing data into the right format: bed or csv

Tucan accepts input in either BED or CSV format, containing the methylation calls.
An example BED file snippet is available here: data/bedExample.bed.

Methylation calls can be extracted from a Nanopore BAM file using modkit.
Detailed instructions for this process are available in the Sturgeon repository.

Note: The Sturgeon implementation determines methylation status at CpG sites based on the Illumina methylation array coordinates, including a ±25 bp window around each CpG site.
This means methylation calls are aggregated not only at the exact array position but also within this surrounding margin.

Demo

A small example dataset is provided in the repository under demo_tucan/Data, along with the corresponding expected output in demo_tucan/Output.

Run the demo:

tucan -i demo_tucan/Data/testSample1.merged_probes_methyl_calls.bed -m <path_to_model> -c 10000 -o demo_tucan/Output/OutcomeTestSample1.csv -s 1 -f csv

Expected runtime: ~1–5 minutes per sample on a standard CPU.

Running Tucan on your data

  1. Extract methylation calls from Nanopore BAM files (e.g. using modkit)
  2. Convert to BED or CSV format
  3. Run Tucan using the command above

Reproducibility

The example dataset and output provided in demo_tucan can be used to verify correct installation and execution of the Tucan software. The full methodology is described in the manuscript Methods section.

Table: Abbreviation - Tumor Type

See the full table: docs/tumor_abbreviation.md

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages