Snakemake workflow used to deploy and perform basic indexes of genome sequence.
This is done for teaching purpose as an example of FAIR principles applied with
Snakemake.
The usage of this workflow is described in the Snakemake workflow catalog , it is also available locally on a single page.
The expected results of this pipeline are described here .
The tools used in this pipeline are described here textually.
ββββββββββββββββββββββββββββββββββββββββββ
βDownload Ensembl Sequence (wget + gzip) β
ββββββββββββββββββββ¬ββββββββββββββββββββββ
β
β
ββββββββββββββββββββΌβββββββββββββββββββββββββ
βRemove non-canonical chromosomes (pyfaidx) β
ββββββββββββββββββββ¬βββββββββββββββββββββββ¬ββ
β β
β β
ββββββββββββββββββββΌβββββββββββ βββΌββββββββββββββββββββββββββββββββββββ
βIndex DNA Sequence (samtools)β βCreate sequence dictionary (Picard) β
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
Get genome annotation (GTF)
Step
Commands
Download GTF annotation
ensembl-annotation
Fix format errors
Agat
Remove non-canonical chromosomes, based on above DNA Fasta
Agat
Remove <NA> Transcript support levels
Agat
Convert GTF to GenePred format
gtf2genepred
βββββββββββββββββββββββββββββββββββββββββββ
βDownload Ensembl Annotation (wget + gzip)β
βββββββββββββββ¬ββββββββββββββββββββββββββββ
β
β
βββββββββββββββΌββββββββββ
βFix format Error (Agat)β
βββββββββββββββ¬ββββββββββ
β
β
βββββββββββββββΌββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββ
βRemove non-canonical chromosomes (Agat)βββββββββββββ€Fasta sequence index (see Get DNA Fasta)β
βββββββββββββββ¬ββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββΌββββββββββββββββββββββββ
βRemove <NA> transcript levels (Agat) β
βββββββββββββββ¬ββββββββββββββββββββββββ
β
β
βββββββββββββββΌβββββββββββββββββ
βConvert GTF to GenePred (UCSC)β
ββββββββββββββββββββββββββββββββ
Step
Commands
Extract transcript sequences from above DNA Fasta and GTF
gffread
Index DNA sequence
samtools
Creatse sequence Dictionary
picard
βββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
βGTF (see get genome annotation)β βDNA Fasta (See get dna fasta)β
ββββββββββββββββββββββ¬βββββββββββ ββββββββββ¬βββββββββββββββββββββ
β β
β β
ββββββββΌββββββββββββββββββββββββββββΌββββββ
βExtract transcripts sequences (gffread) β
ββββββββ¬ββββββββββββββββββββββββββββ¬ββββββ
β β
β β
ββββββββββββββββββββββΌβββββ ββββββββββΌββββββββββββββββββββββββββββ
βIndex sequence (samtools)β βCreate sequence dictionary (Picard) β
βββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββ
Step
Commands
Extract coding transcripts from above GTF
Agat
Extract coding sequences from above DNA Fasta and GTF
gffread
Index DNA sequence
samtools
Creatse sequence Dictionary
picard
βββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
βGTF (see get genome annotation)β βDNA Fasta (See get dna fasta)β
ββββββββββββββββββββββ¬βββββββββββ ββββββββββ¬βββββββββββββββββββββ
β β
β β
ββββββββΌββββββββββββββββββββββββββββΌββββββ
βExtract cDNAΒ Β Β Β Β Β Β sequences (gffread) β
ββββββββ¬ββββββββββββββββββββββββββββ¬ββββββ
β β
β β
ββββββββββββββββββββββΌβββββ ββββββββββΌββββββββββββββββββββββββββββ
βIndex sequence (samtools)β βCreate sequence dictionary (Picard) β
βββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββ
βDownload dbSNP variants (wget + bcftools) β
ββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
β
ββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
βRemove non-canonical chromosomes (bcftools + bedtools)β
ββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββΌββββββββββββββ
βIndex variants (tabix) β
ββββββββββββββββββββββββββ
Get transcript_id, gene_id, and gene_name correspondancy
Step
Commands
Extract gene_id <-> gene_name correspondancy
pyroe
Extract transcript_id <-> gene_id <-> gene_name
Agat + XSV
ββββββββββββββββββββββββββββββββββ
βGenome annotation (see get GTF) ββββββββββββββββββββ
ββββββββ¬ββββββββββββββββββββββββββ β
β β
β β
ββββββββΌβββββββββββββββββββββββββββββββ ββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ
βExtract gene_id <-> gene_name (pyroe)β βExtract gene_id <-> gene_name <-> transcript_id (Agat)β
ββββββββ¬βββββββββββββββββββββββββββββββ ββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β β
β β
ββββββββΌββββββ ββββββββββΌβββββ
βFormat (XSV)β βFormat (XSV) β
ββββββββββββββ βββββββββββββββ
ββββββββββββββββββββββββββββββββββ
βDownload known blacklists (wget)β
ββββββββββββββ¬ββββββββββββββββββββ
β
β
ββββββββββββββΌβββββββββββββββββββββββββββ
βMerge overlapping intervals (bedtools) β
βββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββ
βGenome annotation (see get GTF) β
ββββββββββββββ¬ββββββββββββββββββββ
β
β
ββββββββββββββΌβββββββββββββββ
βGTFtoGenePred (UCSC-tools) β
βββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββ
βGenome sequence (see get Fasta) β
ββββββββββββββ¬ββββββββββββββββββββ
β
β
ββββββββββββββΌβββββββββββββββ
βFaToTwoBit (UCSC-tools) β
βββββββββββββββββββββββββββββ
Step
Commands
STAR index
STAR
ββββββββββββββββββββββββββββββββββ
βGenome sequence (see get DNA) β
ββββββββββββββ¬ββββββββββββββββββββ
β
β
βββββββββΌβββββ
β STAR index β
ββββββββββββββ
ββββββββββββββββββββββββββββββββββ
βGenome sequence (see get DNA) β
ββββββββββββββ¬ββββββββββββββββββββ
β
β
βββββββββΌβββββ
β STAR index β
ββββββββββββββ
Salmon decoy aware gentrome index
Step
Commands
Generate decoy
Bash
Salmon index
Salmon
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
βGenome sequence (see get DNA)β βTranscriptome sequence (see get cDNA)β
ββββββββββββββββββββββββββββ¬βββ βββββββ¬ββββββββββββββββββββββββββββββββ
β β
β β
β β
ββββββΌββββββββββββββββββΌβββββ
βGenerate decoy and gentromeβ
βββββββββββββββ¬ββββββββββββββ
β
βββββββββββββββββββ β βββββββββββββββββ
βGentrome sequenceββββββββββββββββββ΄ββββββΊDecoy sequencesβ
ββββββββββββββ¬βββββ ββββββ¬βββββββββββ
β β
β β
β ββββββββββββββββ β
βββββββββΊ Salmon index βββββββββββ
ββββββββββββββββ