Skip to content

jillhoffman/HoffmanJill_CPBSPrelim_Day3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

De Bruijn graphs for de novo genome assembly and alignment

This algorithm was designed with pandemic preparedness in mind by identifying the origin of zoonotic diseases for quick identification of potential treatments.

Given a fastq file of short RNA-seq reads and a fasta file of reference genomes, this algorithm first assembles the short reads into contigs then maps the assembled contigs to the reference genomes. The assembled contigs are provided in a .fasta file and their alignment with viromes are described in a .aln file.

Genome assembly is done by using De Bruijn graphs, where each sequence is broken into k-mers and these k-mers are mapped as a graph. The k-mers are then assembled based on the longest possible paths through the graph until each k-mer is used. The assembled contigs are then mapped to reference genomes by sliding the contig across the reference genome until a section of the virome is match to the contig based on the specified minimum similarity.

Set Up

All set up and running assumes you are in the directory this repository was cloned into. The .env file is set to run as is, but paths can be adjusted as desired.

1. Download SRA Toolkit to download data

Once downloaded and set up, the following commands can be used to download an example .fastq file from bat fecal samples.
Note: it may be necessary to put the full sratoolkit/bin path to run the prefetch and fasterq-dump commands

mkdir data
prefetch SRR12464727 -O data
fasterq-dump data/SRR12464727 --outdir data/SRR12464727

2. Download conda

3. Create environment

conda env create -f environment.yml
conda activate Hoffman_Day3

Run Scripts

All scripts are set up as jupyter notebooks (included in environment) to visualize all outputs. Each scripts outputs and analysis were made from a subset of 2,000 reads. However, the scripts are set up to run on a subset of 1,000 reads to allow for a quick run through starting from script 02. All notebooks functions, inputs, parameters, and outputs are described below as well as in each notebook.

1. scripts/01_dataExploration_QC.ipynb

The goal of this notebook is to conduct initial data exploration, quality control, and visualization of a fastq file of short RNA-seq reads. The following is done:

  1. Plot lengths of reads
  2. Plot average quality score for reads across alleles
  3. Remove reads with average quality scores less than 30
  4. Identify overrepresented reads

This script takes about 40 minutes to run all the way through

Inputs:

SRR.fastq.gz gzip fastq file of short reads (downloaded in step 1, too large to include in repo)

Outputs:

SRR.json.gz: dictionary of sequences to be used for genome assembly (data/SRR12464727/SRR12464727.subset.1000.json.gz) lowQualityReads.txt.gz list of reads with average quality of >30 to be removed from analysis (data/SRR12464727/SRR12464727_lowQualityReads.txt.gz)

Parameters:

subsetNum: Number of sequences from fastq file to be analyzed after removal of low quality reads (default 1000) SRR: ID for fastq file (default SRR12464727)

2. scripts/02_genomeAssembly.ipynb

The goal of this notebook is to assemble short reads from a fastq file into a virome. To achieve this the following is done:

  1. Splits reads into kmers
  2. Assembles kmers from sequences in 2 into genome
  3. Visualizes length and number of reads of assembled sequences

With default settings, this script takes about 30 minutes to run.

Inputs:

SRR.json.gz: dictionary of QC'd sequences from a fastq file (output of step 1, 01_dataExploration_QC.ipynb)

Outputs:

.fasta: fasta file of assembled viromes (outputs/SRR12464727.assembledGenomes.1000.fasta)

Parameters:

subsetNum: Number of sequences from fastq file to be analyzed, after removal of low quality reads (default 1000) SRR: ID for fastq file (default SRR12464727)
kmerSize: fraction of kmer length to be used to calculate kmer size (default 0.5)

3. scripts/03_sequenceMatching.ipynb

The goal of this notebook is to map assembled viromes to known sequences from .fasta file. This notebook does the following:

  1. Maps assembled genomes to known viromes
  2. Visualizes the number of matches and their percent similarity

With default settings, this script takes about 20 minutes to run.

Inputs:

.fasta: fasta file of assembled viromes (output of step 2, 02_genomeAssembly.ipynb)

Outputs:

SRR.aln: file describing how assembled genomes align with matched virome with the following columns:

(assumes reference is forward aligned)

  • vID: ID of virome that contig was mapped to
  • cID: ID of contig that was mapped to virome
  • cstart: starting position of where contig matches to vstart
  • cend: ending position of where contig matches to vend
  • vestart: starting position of where virome matches to cstart
  • vend: ending position of where virome matches to cstart
  • ratio: ratio of similarity between contig and matched section of virome
  • seq: sequence of matched section of virome

Parameters:

subsetNum: Number of sequences from fastq file to be analyzed, after removal of low quality reads (default 1000) SRR: ID for fastq file (default SRR12464727) minSimilarityRatio: minimum similarity ratio required for match (default 0.4) shift: number of bps to shift along the reference virome by (default 10)

Testing

Unit testing was done for each function used in the previous section. The script containing all the functions can be found in scripts/functions.py. All tests and descriptions can be found in scripts/unitTesting.py All 6 tests were passed. To run tests:

Testing takes about 5 minutes to run.

cd scripts
python -m unittest unitTesting.py

Parameter Evalutaion

1. K-mer size in assembly

A smaller k-mer size is less stringent in assembling contigs. This results in fewer but longer contigs to be assembled compared to a larger k-mer size. The distribution of sequence lengths when using k-mer size of 1/2 (75-bps) and 2/3 (100-bps) the length of the reads are shown below.

alt text alt text

2. Shift amount in mapping

A smaller shift would result in more coverage of the reference genome, but significantly increases the time it takes to run. The effect of using different shifts on run time and the number of matched sequences using a k-mer size of 75 is shown below.

alt text alt text

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors