1212
1313
1414###############################################################
15- # AlignmentProcessor0.11 Package
15+ # AlignmentProcessor0.20 Package
1616#
1717# Dependencies: Python 3
1818# Python 3 version of Biopython
1919# Perl (if using KaKs_Calculator)
2020# PAML (if using CodeML)
21- # R (if using CodeML)
22- # ape R package (if using CodeML)
21+ # PhyML (if using CodeML)
2322###############################################################
2423
2524### Contents ###
3736AlignmentProcessor is a pipeline meant to quickly convert a multi-fasta
3837alignment file into a format that can be read by KaKs_Calculator or PAML and
3938optionally run those programs. When running codeml, alignment processor will
40- use input control and tree files as templates to create unique control and
41- tree files for each gene. It will call the ape R package to dynamically trim
42- the given phylogenic tree so that only species which remain in each gene's
43- alignment after trimming are represented in that gene's tree.
39+ use an input control file as a template to create a unique control for each
40+ gene. It will call PhyML to create a unique phylogenic tree for each gene
41+ based off of its sequences.
4442
4543You can run the AlignmentProcessor wrapper which will call all of the python
4644scripts in sequence, or you may call each script individually.
@@ -63,34 +61,27 @@ a terminal and Anaconda will install Biopython for you:
6361
6462# KaKs_Calculator
6563
66- AlignmentProcessor0.11 is packaged with KaKs_Calculator2.0 binaries for Linux
64+ AlignmentProcessor0.20 is packaged with KaKs_Calculator2.0 binaries for Linux
6765and Windows, and a KaKs_Calculator1.2 binary for Mac (there is no 2.0 binary
6866available for OSX). Before using, copy or move the appropriate binary for your
6967system into the AlignmentProcessor bin which contains the python scripts.
7068
71- # PAML 4.8
69+ # PAML
7270
7371If you plan to use CodeML, you must first download PAML
7472(http://abacus.gene.ucl.ac.uk/software/paml.html) and move the folder into the
7573AlignmentProcessor directory. Make sure that it is titled "paml".
7674
77- # Ape
78-
79- The most straightforward way to install ape, and most R packages, is through
80- Bioconductor. If you do not have Bioconductor installed, open R and paste:
81-
82- source("https://bioconductor.org/biocLite.R")
83- biocLite()
84-
85- To install ape, enter:
86-
87- library("BiocInstaller")
88- biocLite("ape")
89-
75+ # PhyML
76+ If you plan to use CodeML, you must also download PhyML
77+ (http://www.atgc-montpellier.fr/phyml/binaries.php). Similar to PAML, you must
78+ move the folder into the AlignmentProcessor directory and change the name of
79+ both the folder and the binary for your operating system to "PhyML".
9080
9181#-------------------------------
9282# 1. Obtaining a fasta alignment
9383#-------------------------------
84+
9485# UCSC Fasta Alignment
9586It is possible to download CDS fasta alignments from the UCSC Table browser.
9687This does, unfortunately, limit you to currently available alignments.
@@ -112,10 +103,16 @@ If you did not use a UCSC genome for the reference species in your alignment,
112103you may need to upload the reference genome that you used as a custom build.
113104Make sure that the genome, maf file, and BED file are all set to the custom
114105build, and that the reference species build name is identical in all three
115- files. Additionally, you may not be able to use the UCSC BED12 file if you did
106+ files.
107+
108+ Additionally, you may not be able to use the UCSC BED12 file if you did
116109not use a UCSC genome. If that is the case, you can either upload your own,
117110or, if you used an Ensembl genome, you can just remove the "chr_UN" and "chr"
118- chromosome prefixes from the file, and resubmit the file to Galaxy.
111+ chromosome prefixes from the file, and resubmit the file to Galaxy. For NCBI
112+ genomes, you may download the gff from NCBI genome, use the UCSC utility
113+ "gff3ToGenePred" with the -useName and -honorStartStopCodons options, and
114+ use the UCSC utility "genePredToBed". This will return a BED12 file which
115+ may be submitted to Galaxy.
119116
120117Upload your maf and BED files to Galaxy (usegalaxy.org) (or retrieve a BED
121118file of the genes for your reference species using the UCSC Main link under
@@ -134,7 +131,7 @@ output file.
134131
135132This process will take a few hours, so plan accordingly.
136133
137- Since this method offer the most flexibility for working with alignments,
134+ Since this method offers the most flexibility for working with alignments,
138135AlignmentProcessor was written with this output format in mind and no further
139136formatting is required.
140137
@@ -156,14 +153,15 @@ If you are running CodeML and the program is interrupted, you may call the
15615307_CodeMLonDir.py script and it will continue where CodeML left off. This
157154will save the time of having to run those genes through CodeML again (It will
158155do the same thing if you call the entire pipeline again, but there is no need
159- to rerun the previous steps). It will not do the same for KaKs_Calculator
156+ to re-run the previous steps). It will not do the same for KaKs_Calculator
160157since KaKs_Calculator is much faster, so it should not be a problem to just
161158invoke KaKs_Calculator on the whole directory again.
162159
163160# Example Usage:
164161
165162 python AlignmentProcessor.py --ucsc --axt/phylip --kaks/codeml \
166- --retainStops -% <decimal> -r <reference species> -i <input fasta file> \
163+ --retainStops -% <decimal> -f <forward branch of codeml tree> \
164+ -r <reference species> -i <input fasta file> \
167165 -o <path to output directory>
168166
169167# Required Arguments:
@@ -214,6 +212,11 @@ invoke KaKs_Calculator on the whole directory again.
214212 AlignmentProcessor can call multiple instances of CodeML to
215213 shorten overall run time. (Default = 1)
216214
215+ -f the build or common name (if you use the --changeNames flag) of the
216+ species on the forward branch of the phylogneic tree supplied to
217+ CodeML. This species does not have to be the same as the reference
218+ species.
219+
217220# Additional Commands
218221
219222 -h/--help will print the program's help dialogue
@@ -245,39 +248,19 @@ invoke KaKs_Calculator on the whole directory again.
245248 Codeml requires that all of its parameters be specified in one control
246249 file (http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf). Provide a
247250 control file with your desired parameters and AlignmentProcessor will
248- use it as template. AlignmentProcessor uses the Biopython codeml module
249- to more easily edit the names of the input and output files. Because of
250- this, however, you must make sure that the first three lines (starting
251- with "seqfile", "treefile", or "outfile" have been removed from the
252- control file. Otherwise, CodeML will presented only with the data
253- contained in the control file, and not the whole directory of alingments.
251+ use it as template.
254252
255253 The control file must be titled titled “codeml.ctl”, and it must be
256254 located in the output directory.
257255
258- # The CodeML tree file
259-
260- If your CodeML analysis requires a phylogenic tree, provide your
261- desired tree, titled "codeml.tree" in the output directory. Specify
262- the tree as you want to appear to CodeML and be sure that the species
263- names are specified as the common names in the 02_nameList.txt file,
264- but trimmed to ten characters (some programs still set a ten character
265- limit on the length of names, so AlignmentProcessor trims the names).
266- The 07_CodeMLonDir.py script will save any nodes you have specified
267- with a "#" before sending a plain Newick tree to ape (which will not
268- work if there are PAML node symbols). It will then add any nodes back
269- into the tree after it has been trimmed. AlignmentProcessor will not
270- currently save nodes specified with "$" since it is difficult to
271- determine where a nested clade begins and ends.
272-
273256# Invoking the Ka/Ks pipeline with a UCSC alignment:
274257
275- python AlignmentProcessor0.11 .py --axt --kaks --ucsc -r anoCar2 \
258+ python AlignmentProcessor0.20 .py --axt --kaks --ucsc -r anoCar2 \
276259 -i anolis_gallus.fa -o pairwiseKaKs/
277260
278261# Invoking the CodeML pipeline with a de novo alignment:
279262
280- python AlignmentProcessor0.11 .py --phylip --codeml -% 0.6 \
263+ python AlignmentProcessor0.20 .py --phylip --codeml -% 0.6 \
281264 -r anoCar2 -i anolis_gallus.fa -o codemlOutput/
282265
283266#-------------------------------
@@ -388,26 +371,15 @@ Remember that the order of the arguments does matter for these scripts.
388371# 07_CodeMLonDir.py
389372
390373 This script will run codeml on every file in a directory. It requires
391- the codeml.ctl file, and likely a tree file which it will supply to
392- codeml. It will overwrite the "seqfile", "treefile", "outfile" lines
393- include the paths to the input phylip file, the output file, and the
394- tree file. It will also call the ape R package to trim the tree file
395- so that it only includes species which have not been filtered out. If you
396- are running CodeML and the program is interrupted, you may invoke this
374+ the codeml.ctl file. It will overwrite the "seqfile", "treefile",
375+ "outfile" lines include the paths to the input phylip file, the output
376+ file, and the tree file. It will also PhyML to create the tree file. If
377+ you are running CodeML and the program is interrupted, you may invoke this
397378 script to pick up where you left off.
398379
399- python 07_CodeMLonDir.py <path to codeml control file> \
400- <path to input and output directories>
401-
402- # 07_pruneTree.py
403-
404- This script will dynamically trim input trees for CodeML if any sequences
405- have been removed. Species whose sequences were removed in steps 4 or 5
406- and are no longer in the phylip alignment will be removed from
407- the temporary tree given to CodeML.
380+ python 07_CodeMLonDir.py -t <# of threads> -f <name of forward branch> \
381+ -i <path to input and output directories>
408382
409- python 07_pruneTree.py <path to input directory> \
410- <list of species remaining in alignment> <path to tmep output directory>
411383
412384# 08_compileKaKs.py
413385
@@ -441,7 +413,7 @@ attempt to concatenate specific parts from the output files.
441413
442414If you wish to convert convert the files to both formats, specify both --axt
443415and --phylip the program will convert the fasta files to both formats. You may
444- also run one of the individual scripts on the 07_rmStops directory to convert
416+ also run one of the individual scripts on the 05_rmStops directory to convert
445417the files in a separate step. AlignmentProcessor will not, however, run
446418KaKs_Calculator and CodeML simultaneously, as this could require too much
447419memory.
@@ -460,11 +432,11 @@ python AlignmentProcessor.py --axt --kaks --ucsc -r anoCar2 \
460432This will return a text file with 11 lines.
461433
462434# To test CodeML:
463- The test directory already contains sample CodeML control and tree files , so
435+ The test directory already contains a sample CodeML control file , so
464436all you need to do is change into the AlignmentProcessor directory and paste
465437the following:
466438
467439python AlignmentProcessor.py --phylip --codeml --ucsc -t 2 -r anoCar2 \
468- -i codemlTest.fa -o test/
440+ -f anoCar2 - i codemlTest.fa -o test/
469441
470442There should be 8 .mlc files in the 07_codeml directory.
0 commit comments