Skip to content

prireto/NAGeno

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

147 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAGeno - Nanopore Amplicon GENOtyping

DOI License

A comprehensive pipeline for SNV and indel genotyping on Nanopore Amplicon Sequencing data.

NAGeno starts with basecalled Nanopore Amplicon sequences and returns two overview genotyping tables (SNV and indel), a SNV genotype overview plot and more elaborate underlying files. It works for multiplexed samples, as long as each barcode has only been used once.

Publication

Table of contents

Introduction

Accurate genotyping made simple.

Identifying SNVs and indels is essential in molecular biology and clinical diagnostics. Sangersequencing—still the gold-standard for its high accuracy—requires manual inspection to avoid artifacts and catch low-frequency variants. However, it often struggles in GC-rich or highly repetitive regions. NGS provides even higher accuracy with automated analysis, but is typically excessive for small to medium-scale projects and routine lab workflows.

NAGeno - Nanopore Amplicon Genotyping combines high accuracy even in GC and reptitive regions with Sanger-like simplicity while ensuring scalability making genotyping both robust and effortless.

Workflow

NAGeno performs SNV and indel genotyping on fastq files of nanopore amplicon sequencing. Amplicons can cover regions of approx. 50 bp - 5 kb. In a fully automated workflow, we generate detailed tables for both SNVs and indels, along with an overview plot of SNVs per sample, by following these steps:

Created in BioRender.

Installation

Clone this repository

git clone https://github.com/prireto/NAGeno.git

Two .yml files are included into the repository at envs/scripts. For the full functionality (i.e. analysis and plotting), both of them need to be created via

conda env create -f NAGeno/envs/nageno.yml
conda env create -f NAGeno/envs/nageno_plot.yml
conda activate nageno

Note

Alternative to conda, mamba or micromamba can also be used for the creation of the environments which will be much faster. Since nageno ultimately uses the environments for some parts of the analysis, make sure that the command conda activate nageno works. Alternatively, make sure that you are setting --manager to your respective dependency manager (e.g. micromamba) while using the pipeline.

Further, the somatic variant caller, ClairS-TO, and its models need to be installed manually, as explained here.

Warning

clairs-to searches for the models at echo ${CONDA_PREFIX}/bin. This unfortunately can not be changed easily and thus you need to make sure that clairs-to_models, clairs-to_databases, and clairs-to_cna_data exist in the bin-folder of the nageno environment. You can prevent this extra step by, as described above, activating the nageno environment first and then proceed with the manual clairs-to installation.

Condensed relevant information about the manual installation of ClairS-TO (click to expand)
# in case of a timeout error (Download error (28) Timeout was reached) try modifying timeout settings (works exactly like this only for mamba and conda)
#conda config --set remote_connect_timeout_secs 30
#conda config --set remote_read_timeout_secs 30 

git clone https://github.com/HKU-BAL/ClairS-TO.git
cd ClairS-TO

# make sure in clairs-to environment
# download pre-trained models and other resources
echo ${CONDA_PREFIX}
mkdir -p ${CONDA_PREFIX}/bin/clairs-to_models
mkdir -p ${CONDA_PREFIX}/bin/clairs-to_databases
mkdir -p ${CONDA_PREFIX}/bin/clairs-to_cna_data
wget http://www.bio8.cs.hku.hk/clairs-to/models/clairs-to_models.tar.gz
wget http://www.bio8.cs.hku.hk/clairs-to/databases/clairs-to_databases.tar.gz
wget http://www.bio8.cs.hku.hk/clairs-to/cna_data/reference_files.tar.gz
tar -zxvf clairs-to_models.tar.gz -C ${CONDA_PREFIX}/bin/clairs-to_models/
tar -zxvf clairs-to_databases.tar.gz -C ${CONDA_PREFIX}/bin/clairs-to_databases/
tar -zxvf reference_files.tar.gz -C ${CONDA_PREFIX}/bin/clairs-to_cna_data/
cd ../NAGeno

./run_clairs_to --help

Remember to deactivate the nageno env before using NAGeno.

conda deactivate

Usage

Generally, nageno can be used with two subcommands, analysis and plot.

    Usage: nageno [SUBCOMMAND] [OPTIONS]

Subcommands:

  analysis             Runs genotype analysis. Use --help for mandatory and optional inputs.
  plot                 Runs post-analysis summary and plotting functions.

Use 'nageno [SUBCOMMAND] --help' for more information on a subcommand.

Typical execution order:
  1. Run the analysis subcommand with your parameters:
     nageno analysis [YOUR OPTIONS]

  2. After completion, run the plot subcommand with the same settings:
     nageno plot [YOUR OPTIONS]

Analysis

Usage: nageno analysis --dir DIR --anno ANNO --ref REF --bed BED --txfile TXFILE [OPTIONS]

Mandatory arguments:
  --dir                DIR                  Directory containing fastq files
  --anno               ANNO                 Sample sheet file
  --ref                REF                  Reference genome file - .fa file needed, .fai files needs to be present too
  --bed                BED                  BED file for reference
  --txfile             TXFILE               File for visualization - transcript annotation needs to match the SPEFF_REF, default is RefSeq (NM_...), can also be ENSEMBL (ENST...)

Optional arguments:
  --manager            MANAGER              Package manager used to activate environments (default: conda)
  --threads            THREADS              Number of cores to use (default: 1)
  --min-q              MIN_Q                Minimum base quality (default: 30)
  --max-u              MAX_U                Percentage of bases allowed below MIN_Q (default: 10)
  --mapq               MAPQ                 Minimum mapping quality (default: 50)
  --analysis-dir       DIR                  Directory for output (default: ./analysis)
  --ext                EXT                  Sample name extension (default: SQK-RBK114-24_barcode)
  --clairs-to-path     CLAIR_PATH           Absolute path to 'run_clairs_to' - depends on where ClairS-TO was installed. (default: run_clairs_to)
  --clairs-to-model    CLAIR_MODEL          Clairs-to model (default: ont_r10_dorado_sup_5khz)
  --snpeff-ref         SNPEFF_REF           SNPeff reference genome - should always be the same as the one used for alignment (default: GRCh38.p14)

Note

Currently available clairs-to models are 'ont_r10_dorado_sup_4khz', 'ont_r10_dorado_hac_4khz', 'ont_r10_dorado_sup_5khz', 'ont_r10_dorado_sup_5khz_ss', 'ont_r10_dorado_sup_5khz_ssrs', 'ont_r10_guppy_sup_4khz', 'ont_r10_guppy_hac_5khz', 'ilmn' and 'hifi_revio'. They can be checked here.

Plot

Usage: nageno plot --dir DIR --anno ANNO --txfile TXFILE [OPTIONS]

!!! Attention !!!

Make sure you are using the same options as for the analysis.
The generated output files will otherwise not be recognized properly.

Mandatory arguments:
  --dir                DIR                  Directory containing fastq files
  --anno               ANNO                 Sample sheet file
  --txfile             TXFILE               File for visualization
  --bed                BED                  BED file for reference

Optional arguments:
  --manager            MANAGER              Package manager used to activate environments (default: conda)
  --min-q              MIN_Q                Minimum base quality (default: 30)
  --max-u              MAX_U                Percentage of bases allowed below MIN_Q (default: 10)
  --mapq               MAPQ                 Minimum mapping quality (default: 50)
  --analysis-dir       DIR                  Directory for output (default: ./analysis)
  --clairs-to-model    CLAIR_MODEL          Clairs-to model (default: ont_r10_dorado_sup_5khz)

Tutorial

Using the exemplary test data in tutorial, the correct setup can be confirmed and exemplary output can be generated:

nageno analysis \
  --dir tutorial/test_data/fastq \
  --anno tutorial/Src/barcode_assignment.tsv \
  --ref /path/to/ref/genome/hg38.fa \
  --bed tutorial/Src/geno_panel_v4.1.bed \
  --txfile tutorial/Src/tx.tsv \
  --analysis-dir tutorial/analysis \
  --threads 20 \
  --clairs-to-path /path/to/run_clairs_to
Potential errors:
  • [ERROR] file .../envs/nageno/bin/clairs-to_models/ont_r10_dorado_sup_5khz/pileup_affirmative.pkl not found: Make sure that clairs-to_models, clairs-to_databases, and clairs-to_cna_data exist in the bin-folder of the nageno environment. => The best way to ensure that is by installing ClairS-TO while the nageno env is activated.

  • [ERROR] while connecting to https://snpeff.blob.corewindows.net/databases/v5_2snpEff_v5_2[refGenomeVersion].zip: SnpEff usually downloads the required databases automatically. However, there have been occasional issues due to re-structuring in the past. In that case, try a manual download within the tool environment at .../conda/envs/tool/share/snpeff-5.2-1/ via:

    conda activate nageno #snpeff runs in the nageno env
    java -Xmx4g -jar snpEff.jar download -v [refGenomeVersion]
    conda deactivate

    or use another database. All databases can be viewed with:

    conda activate nageno #snpeff runs in the nageno env
    java -Xmx4g -jar snpEff.jar databases
    conda deactivate

    The annotation database should always match the database previously used for annotation and variant calling. You can read more on that issue here.

  • ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ... failed: During analysis, either the nageno or the nageno_plot environment is used to run commands. A common error can occur at this stage due to a known Conda issue where conda commands are not available in subshells within Bash scripts. Fortunately, the workarounds proposed in that issue resolved the problem in all our test cases.

  • EnvironmentLocationNotFound: Not a conda environment: /home/user/App/conda/envs/nageno/envs/nageno: No conda/mamba/micromamba environment should be activated when you start the script. Nageno activated the environments it needs, pre-activation causes confusion.

The nageno plot subfunction results in the creation of various different visualisations for the nageno analysis output. This is supposed to be used as a quick and comprehensive overview about the genotypes of your samples.

nageno plot \
  --dir tutorial/test_data/fastq \
  --anno tutorial/Src/barcode_assignment.tsv \
  --ref /path/to/ref/genome/hg38.fa \
  --bed tutorial/Src/geno_panel_v4.1.bed \
  --txfile tutorial/Src/tx.tsv \
  --analysis-dir tutorial/analysis \
  --threads 20 \
  --clairs-to-path /path/to/run_clairs_to

Tip

nageno plot needs less arguments than nageno analysis. Since additional arguments are ignored, the quickest way to use the plotting functionality on your results is by replacing the analysis with the plot subcommand and re-run.

Output

Output files and their respective visualisation for the provided test data are displayed below. Per sample vcf files, vcf collection files for all samples, filtered fastq and filtered bam files as well as more detailed bam depth data are saved in the analysis directory along with log files and html files generated by fastplong and SnpEff.

Table 1 – SNV genotyping results (all SNVs):

SNV_genotyping_results.tsv

SAMPLE GENE CHROM HGVS.p HGVS.c AF POS REF ALT GQ DP FILTER GT AD QUAL Annotation Annotation_Impact Feature_ID mutGeneID.p mutGeneID.c
S3 SF3B1 chr2 NA c.2078-89G>A 0.9952 197402219 C T 75 2721 NonSomatic 0/1 8;2708 75.5371 intron_variant MODIFIER NM_012433.4 NA c.2078-89G>A_SF3B1
S3 SF3B1 chr2 p.Arg625Gly c.1873C>G 0.3057 197402760 G C 64 2787 PASS 0/1 1933;852 64.2183 missense_variant MODERATE NM_012433.4 p.Arg625Gly_SF3B1 c.1873C>G_SF3B1
S3 GNAQ chr9 NA c.735+34T>C 0.7689 77794429 A G 88 3527 NonSomatic 0/1 814;2712 88.5596 intron_variant MODIFIER NM_002072.5 NA c.735+34T>C_GNAQ
S3 GNAQ chr9 p.Arg210Lys c.629G>A 0.4074 77794569 C T 60 4534 LowQual;StrandBias 0/1 2658;1847 0 missense_variant MODERATE NM_002072.5 p.Arg210Lys_GNAQ c.629G>A_GNAQ
S3 GNAQ chr9 p.Gln209Leu c.626A>T 0.4198 77794572 T A 52 4743 LowQual;StrandBias 0/1 2435;1991 0 missense_variant MODERATE NM_002072.5 p.Gln209Leu_GNAQ c.626A>T_GNAQ
S3 GNAQ chr9 NA c.606-304C>T 0.8471 77794896 G A 51 6036 NonSomatic 0/1 920;5113 51.4948 intron_variant MODIFIER NM_002072.5 NA c.606-304C>T_GNAQ
S3 SRSF2 chr17 p.Asp48Asp c.144C>T 1 76737017 G A 102 562 NonSomatic 1/1 0;562 102 synonymous_variant LOW NM_001195427.2 p.Asp48Asp_SRSF2 c.144C>T_SRSF2
S3 GNA11 chr19 p.Gln209Leu c.626A>T 0.1061 3118944 A T 21 198 PASS 0/1 176;21 21.1036 missense_variant MODERATE NM_002067.5 p.Gln209Leu_GNA11 c.626A>T_GNA11
S3 GNA11 chr19 NA c.736-20T>G 0.9537 3119186 T G 88 216 NonSomatic 0/1 8;206 88.305 intron_variant MODIFIER NM_002067.5 NA c.736-20T>G_GNA11
S3 GNA11 chr19 p.Thr257Thr c.771C>T 0.0575 3119241 C T 17 313 NonSomatic 0/1 295;18 17.4944 synonymous_variant LOW NM_002067.5 p.Thr257Thr_GNA11 c.771C>T_GNA11
S3 GNA11 chr19 NA c.889+8G>C 0.1087 3119367 G C 21 322 NonSomatic 0/1 287;35 21.6758 splice_region_variant&intron_variant LOW NM_002067.5 NA c.889+8G>C_GNA11
S3 GNA11 chr19 NA c.889+48T>G 0.8321 3119407 T G 47 280 NonSomatic 0/1 17;233 47.9067 intron_variant MODIFIER NM_002067.5 NA c.889+48T>G_GNA11
S8 SF3B1 chr2 NA c.2078-89G>A 0.9995 197402219 C T 75 1885 NonSomatic 0/1 1;1884 75.0965 intron_variant MODIFIER NM_012433.4 NA c.2078-89G>A_SF3B1
S8 GNAQ chr9 NA c.735+34T>C 0.9983 77794429 A G 89 1185 NonSomatic 0/1 2;1183 89.5603 intron_variant MODIFIER NM_002072.5 NA c.735+34T>C_GNAQ
S8 GNAQ chr9 p.Gln209Leu c.626A>T 0.2031 77794572 T A 53 1541 PASS 0/1 1219;313 53.6384 missense_variant MODERATE NM_002072.5 p.Gln209Leu_GNAQ c.626A>T_GNAQ
S8 GNAQ chr9 NA c.606-304C>T 0.9942 77794896 G A 51 2233 NonSomatic 0/1 13;2220 51.4754 intron_variant MODIFIER NM_002072.5 NA c.606-304C>T_GNAQ
S8 SRSF2 chr17 p.Asp48Asp c.144C>T 0.9956 76737017 G A 102 684 NonSomatic 0/1 3;681 102 synonymous_variant LOW NM_001195427.2 p.Asp48Asp_SRSF2 c.144C>T_SRSF2
S8 GNA11 chr19 NA c.736-20T>G 0.9868 3119186 T G 88 152 NonSomatic 0/1 2;150 88.6584 intron_variant MODIFIER NM_002067.5 NA c.736-20T>G_GNA11
S8 GNA11 chr19 NA c.889+48T>G 0.88 3119407 T G 53 175 NonSomatic 0/1 2;154 53.0168 intron_variant MODIFIER NM_002067.5 NA c.889+48T>G_GNA11

Allele frequency plot (all SNVs):

SNVs of tutorial test data

Table 2 – SNV genotyping results (protein-coding SNVs):

Prot_coding_SNV_genotyping_results.tsv

SAMPLE GENE CHROM HGVS.p HGVS.c AF POS REF ALT GQ DP FILTER GT AD QUAL Annotation Annotation_Impact Feature_ID mutGeneID.p mutGeneID.c
S3 SF3B1 chr2 p.Arg625Gly c.1873C>G 0.3057 197402760 G C 64 2787 PASS 0/1 1933;852 64.2183 missense_variant MODERATE NM_012433.4 p.Arg625Gly_SF3B1 c.1873C>G_SF3B1
S3 GNAQ chr9 p.Arg210Lys c.629G>A 0.4074 77794569 C T 60 4534 LowQual;StrandBias 0/1 2658;1847 0 missense_variant MODERATE NM_002072.5 p.Arg210Lys_GNAQ c.629G>A_GNAQ
S3 GNAQ chr9 p.Gln209Leu c.626A>T 0.4198 77794572 T A 52 4743 LowQual;StrandBias 0/1 2435;1991 0 missense_variant MODERATE NM_002072.5 p.Gln209Leu_GNAQ c.626A>T_GNAQ
S3 SRSF2 chr17 p.Asp48Asp c.144C>T 1 76737017 G A 102 562 NonSomatic 1/1 0;562 102 synonymous_variant LOW NM_001195427.2 p.Asp48Asp_SRSF2 c.144C>T_SRSF2
S3 GNA11 chr19 p.Gln209Leu c.626A>T 0.1061 3118944 A T 21 198 PASS 0/1 176;21 21.1036 missense_variant MODERATE NM_002067.5 p.Gln209Leu_GNA11 c.626A>T_GNA11
S3 GNA11 chr19 p.Thr257Thr c.771C>T 0.0575 3119241 C T 17 313 NonSomatic 0/1 295;18 17.4944 synonymous_variant LOW NM_002067.5 p.Thr257Thr_GNA11 c.771C>T_GNA11
S8 GNAQ chr9 p.Gln209Leu c.626A>T 0.2031 77794572 T A 53 1541 PASS 0/1 1219;313 53.6384 missense_variant MODERATE NM_002072.5 p.Gln209Leu_GNAQ c.626A>T_GNAQ
S8 SRSF2 chr17 p.Asp48Asp c.144C>T 0.9956 76737017 G A 102 684 NonSomatic 0/1 3;681 102 synonymous_variant LOW NM_001195427.2 p.Asp48Asp_SRSF2 c.144C>T_SRSF2

Allele frequency plot (protein-coding SNVs):

Protein-coding SNVs of tutorial test data

Table 3 – Indel genotyping results:

Indel_genotyping_results.tsv

SAMPLE CHROM AF POS REF ALT GQ DP FILTER GT AD QUAL
S8 chr17 0.1902 76736863 GGTGTGAGTCCGGGGGGCGGCCGTA G 24 594 PASS 0/1 479;113 24.9978
Table 4 – Summary of depth statistics:

Summary_depth_stats.tsv

sample contig gene median mean
S3 chr17 SRSF2 393 420.05
S3 chr19 GNA11 238 237.53
S3 chr2 SF3B1 2537 2518.72
S3 chr9 GNAQ 5992 5406.32
S8 chr17 SRSF2 458 474.01
S8 chr19 GNA11 160 152.73
S8 chr2 SF3B1 1427 1538.44
S8 chr9 GNAQ 2154 1874.96

Per sample, per gene depth plots exemplarily displayed for two genes:

Citation

NAGeno has been described and benchmarked here: [Publication](BioRXive link / doi)

Please cite NAGeno if you use it in your analysis. [BibTex key.]

Contribution

We welcome all forms of input, new ideas, user feedback, or performance improvements. If you come across any bugs or unexpected behavior, we encourage you to open an issue and include relevant error messages or context to help us troubleshoot efficiently.

License

This project is licensed under the Apache License 2.0.

Credit

Original logo concept: @aweich

About

SNV and indel genotyping by Nanopore Amplicon Sequencing and subsequenct variant calling

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors