1. Background

purge_dups purges/deletes haplotigs and overlaps in an assembly based on read depth.

See https://github.com/dfguan/purge_dups

2. Dependencies

This step should be run after a raw assembly is generated with HiFi reads OR after the Arrow-polishing step of a genome generated with CLR reads
purge_dups 1.2.5
slurm workload manager https://slurm.schedmd.com/documentation.html

3. Setup

3.1 create a text file in the working directory called "pbfofn" listing the complete path(s) of the PacBio fastq file, one per line

cd workingdir

ln -s /path/to/hifi_1_consensus.fastq
ln -s /path/to/hifi_2_consensus.fastq

ls -1 *fastq > pbfofn

3.2 run pd_config.py to generate a config file for the purge_dups run, make sure you have correct path to reference fasta file

usage: pd_config.py [-h] [-s SRF] [-l LOCD] [-n FN] [--version] ref pbfofn

generate a configuration file in json format

positional arguments:
  ref                   reference file in fasta/fasta.gz format
  pbfofn                list of pacbio file in fastq/fasta/fastq.gz/fasta.gz format (one absolute file path per line)

optional arguments:
  -h, --help            show this help message and exit
  -s SRF, --srfofn SRF  list of short reads files in fastq/fastq.gz format (one record per line, the
                        record is a tab splitted line of abosulte file path
                        plus trimmed bases, refer to
                        https://github.com/dfguan/KMC) [NONE]
  -l LOCD, --localdir LOCD
                        local directory to keep the reference and lists of the
                        pacbio, short reads files [.]
  -n FN, --name FN      output config file name [config.json]
  --version             show program's version number and exit

This step should create a config.json file as specified with the --name parameter

3.3 Edit config.json

Open the config.json file and edit lines to skip BUSCO and KCM steps.

In other words change busco and kcm blocks FROM "skip": 0 TO "skip": 1

4. Run the script

The script that performs this step is called purge_dups.slurm.sh

Edit it according to the information of your genome.

To run the script

sbatch purge_dups.slurm.sh

5. Outputs

After the pipeline is finished, there will be four new directories in the working directory (set in the configuration file).

coverage: coverage cutoffs, coverage histogram and base-level coverage files
split_aln: segmented assembly file and a self-alignment paf file.
purge_dups: duplicate sequence list.
seqs: purged primary contigs ending with .purge.fa and haplotigs ending with .red.fa, also K-mer comparison plot and busco results are also in this directory.

6. QC the assembly

We provide a script BUSCO_metrics_slurm.sh that runs assemblathon.pl to calculate assembly contiguity.

It also runs BUSCO (database: insecta_odb10) to calculate genome completeness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1. Background

2. Dependencies

3. Setup

3.1 create a text file in the working directory called "pbfofn" listing the complete path(s) of the PacBio fastq file, one per line

3.2 run pd_config.py to generate a config file for the purge_dups run, make sure you have correct path to reference fasta file

3.3 Edit config.json

4. Run the script

5. Outputs

6. QC the assembly

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

1. Background

2. Dependencies

3. Setup

3.1 create a text file in the working directory called "pbfofn" listing the complete path(s) of the PacBio fastq file, one per line

3.2 run pd_config.py to generate a config file for the purge_dups run, make sure you have correct path to reference fasta file

3.3 Edit config.json

4. Run the script

5. Outputs

6. QC the assembly