De Bruijn graphs for de novo genome assembly and alignment

This algorithm was designed with pandemic preparedness in mind by identifying the origin of zoonotic diseases for quick identification of potential treatments.

Given a fastq file of short RNA-seq reads and a fasta file of reference genomes, this algorithm first assembles the short reads into contigs then maps the assembled contigs to the reference genomes. The assembled contigs are provided in a .fasta file and their alignment with viromes are described in a .aln file.

Genome assembly is done by using De Bruijn graphs, where each sequence is broken into k-mers and these k-mers are mapped as a graph. The k-mers are then assembled based on the longest possible paths through the graph until each k-mer is used. The assembled contigs are then mapped to reference genomes by sliding the contig across the reference genome until a section of the virome is match to the contig based on the specified minimum similarity.

Set Up

All set up and running assumes you are in the directory this repository was cloned into. The .env file is set to run as is, but paths can be adjusted as desired.

1. Download SRA Toolkit to download data

Once downloaded and set up, the following commands can be used to download an example .fastq file from bat fecal samples.
Note: it may be necessary to put the full sratoolkit/bin path to run the prefetch and fasterq-dump commands

mkdir data
prefetch SRR12464727 -O data
fasterq-dump data/SRR12464727 --outdir data/SRR12464727

2. Download conda

3. Create environment

conda env create -f environment.yml
conda activate Hoffman_Day3

Run Scripts

All scripts are set up as jupyter notebooks (included in environment) to visualize all outputs. Each scripts outputs and analysis were made from a subset of 2,000 reads. However, the scripts are set up to run on a subset of 1,000 reads to allow for a quick run through starting from script 02. All notebooks functions, inputs, parameters, and outputs are described below as well as in each notebook.

1. scripts/01_dataExploration_QC.ipynb

The goal of this notebook is to conduct initial data exploration, quality control, and visualization of a fastq file of short RNA-seq reads. The following is done:

Plot lengths of reads
Plot average quality score for reads across alleles
Remove reads with average quality scores less than 30
Identify overrepresented reads

This script takes about 40 minutes to run all the way through

Inputs:

SRR.fastq.gz gzip fastq file of short reads (downloaded in step 1, too large to include in repo)

Outputs:

SRR.json.gz: dictionary of sequences to be used for genome assembly (data/SRR12464727/SRR12464727.subset.1000.json.gz) lowQualityReads.txt.gz list of reads with average quality of >30 to be removed from analysis (data/SRR12464727/SRR12464727_lowQualityReads.txt.gz)

Parameters:

subsetNum: Number of sequences from fastq file to be analyzed after removal of low quality reads (default 1000) SRR: ID for fastq file (default SRR12464727)

2. scripts/02_genomeAssembly.ipynb

The goal of this notebook is to assemble short reads from a fastq file into a virome. To achieve this the following is done:

Splits reads into kmers
Assembles kmers from sequences in 2 into genome
Visualizes length and number of reads of assembled sequences

With default settings, this script takes about 30 minutes to run.

Inputs:

SRR.json.gz: dictionary of QC'd sequences from a fastq file (output of step 1, 01_dataExploration_QC.ipynb)

Outputs:

.fasta: fasta file of assembled viromes (outputs/SRR12464727.assembledGenomes.1000.fasta)

Parameters:

subsetNum: Number of sequences from fastq file to be analyzed, after removal of low quality reads (default 1000) SRR: ID for fastq file (default SRR12464727)
kmerSize: fraction of kmer length to be used to calculate kmer size (default 0.5)

3. scripts/03_sequenceMatching.ipynb

The goal of this notebook is to map assembled viromes to known sequences from .fasta file. This notebook does the following:

Maps assembled genomes to known viromes
Visualizes the number of matches and their percent similarity

With default settings, this script takes about 20 minutes to run.

Inputs:

.fasta: fasta file of assembled viromes (output of step 2, 02_genomeAssembly.ipynb)

Outputs:

SRR.aln: file describing how assembled genomes align with matched virome with the following columns:

(assumes reference is forward aligned)

vID: ID of virome that contig was mapped to
cID: ID of contig that was mapped to virome
cstart: starting position of where contig matches to vstart
cend: ending position of where contig matches to vend
vestart: starting position of where virome matches to cstart
vend: ending position of where virome matches to cstart
ratio: ratio of similarity between contig and matched section of virome
seq: sequence of matched section of virome

Parameters:

subsetNum: Number of sequences from fastq file to be analyzed, after removal of low quality reads (default 1000) SRR: ID for fastq file (default SRR12464727) minSimilarityRatio: minimum similarity ratio required for match (default 0.4) shift: number of bps to shift along the reference virome by (default 10)

Testing

Unit testing was done for each function used in the previous section. The script containing all the functions can be found in scripts/functions.py. All tests and descriptions can be found in scripts/unitTesting.py All 6 tests were passed. To run tests:

Testing takes about 5 minutes to run.

cd scripts
python -m unittest unitTesting.py

Parameter Evalutaion

1. K-mer size in assembly

A smaller k-mer size is less stringent in assembling contigs. This results in fewer but longer contigs to be assembled compared to a larger k-mer size. The distribution of sequence lengths when using k-mer size of 1/2 (75-bps) and 2/3 (100-bps) the length of the reads are shown below.

2. Shift amount in mapping

A smaller shift would result in more coverage of the reference genome, but significantly increases the time it takes to run. The effect of using different shifts on run time and the number of matched sequences using a k-mer size of 75 is shown below.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
images		images
outputs		outputs
report		report
scripts		scripts
testing_outputs		testing_outputs
.env		.env
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

De Bruijn graphs for de novo genome assembly and alignment

Set Up

1. Download SRA Toolkit to download data

2. Download conda

3. Create environment

Run Scripts

1. scripts/01_dataExploration_QC.ipynb

Inputs:

Outputs:

Parameters:

2. scripts/02_genomeAssembly.ipynb

Inputs:

Outputs:

Parameters:

3. scripts/03_sequenceMatching.ipynb

Inputs:

Outputs:

Parameters:

Testing

Parameter Evalutaion

1. K-mer size in assembly

2. Shift amount in mapping

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

De Bruijn graphs for de novo genome assembly and alignment

Set Up

1. Download SRA Toolkit to download data

2. Download conda

3. Create environment

Run Scripts

1. scripts/01_dataExploration_QC.ipynb

Inputs:

Outputs:

Parameters:

2. scripts/02_genomeAssembly.ipynb

Inputs:

Outputs:

Parameters:

3. scripts/03_sequenceMatching.ipynb

Inputs:

Outputs:

Parameters:

Testing

Parameter Evalutaion

1. K-mer size in assembly

2. Shift amount in mapping

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages