Skip to content

thirtysix/reduce_seq_redundancy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

reduce_redundancy

Purpose

The goal of this script is to reduce the redundancy in a set of sequences.

Input

  • A FASTA file of sequences. FASTA entry descriptions are limited to less than 128 characters.
  • Distance threshold.

Process

Iterate through each transcript in the dictionary and:

  1. Generate a distance matrix from the sequences in the input file using Clustal Omega.
  2. Parse distance matrix to dictionary.
  3. For each FASTA entry, identify as a cluster those other sequences with a distance less than given threshold.
  4. For each sequence in each cluster, generate a total combined distance score to the other members in the cluster.
  5. Keep the entry with the lowest distance to the other members of the cluster as a representative sequence. Ignore the other members of the cluster.
  6. Keep all other entries which have not included in any cluster.

Output

  • A FASTA file of all cluster representative sequences and non-clustered sequences.

Dependencies

Usage

  • Edit the input_seqs_filename variable of the 'Initiating Variables' section to point to the input sequences.
  • 12,610 related protein sequences (retrieved by iterative BLAST search) reduced to 2,048 sequences, using a distance threshold of 0.10, in 20 minutes on an i7 processor.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages