1818# Python 3 version of Biopython
1919# Perl
2020# PAML
21+ # R
22+ # ape R package
2123###############################################################
2224
2325### Contents ###
3234# 0. Introduction
3335#-------------------------------
3436
35- AlignmentProcessor is a pipeline mean to quickly convert a multi-fasta
37+ AlignmentProcessor is a pipeline meant to quickly convert a multi-fasta
3638alignment file into a format that can be read by KaKs_Calculator or PAML and
37- optionally run those programs. Each python script can be run individually, or
38- you can run the AlignmentProcessor wrapper which will call all of the files.
39+ optionally run those programs. When running codeml, alignment processor will
40+ use input control and tree files as templates to create unique control and
41+ tree files for each gene. It will call the ape R package to dynamically trim
42+ the given phylogenic tree so that only species which remain in each gene's
43+ alignment after trimming are represented in that gene's tree.
44+
45+ You can run the AlignmentProcessor wrapper which will call all of the python
46+ scripts in sequence, or you may call each script individually.
3947
4048# Installing Biopython
4149
@@ -66,12 +74,26 @@ If you plan to use CodeML, you must first download PAML
6674(http://abacus.gene.ucl.ac.uk/software/paml.html) and move the folder into the
6775AlignmentProcessor directory. Make sure that it is titled "paml".
6876
77+ # Ape
78+
79+ The most straightforward way to install ape, and most R packages, is through
80+ Bioconductor. If you do not have Bioconductor installed, open R and paste:
81+
82+ source("https://bioconductor.org/biocLite.R")
83+ biocLite()
84+
85+ To install ape, enter:
86+
87+ library("BiocInstaller")
88+ biocLite("ape")
89+
90+
6991#-------------------------------
7092# 1. Obtaining a fasta alignment
7193#-------------------------------
7294# UCSC Fasta Alignment
7395It is possible to download CDS fasta alignments from the UCSC Table browser.
74- This does, unfortunately, limit you to the alignments that they have available.
96+ This does, unfortunately, limit you to currently available alignments .
7597If they do have the alignment that you are interested in, however, it will
7698probably be faster to download it rather than generate a new one. Since this
7799precludes the use of user-generated alignments, AlignmentProcessor has been
@@ -125,18 +147,16 @@ run the substitutions quickly, so everything can be run with one command. Each
125147script can be run individually if necessary (each script's function and
126148options will be discussed later). The input order for AlignmentProcessor's
127149command line does not matter, but it does matter for the individual scripts.
128- AlignmentProcessor also comes packaged with KaKs_Calculator2.0 and PAML4.8 for
129- downstream analysis.
130150
131151To execute the AlignmentProcessor pipeline, you must first change into
132152the package directory. Otherwise it will not be able to locate the scripts
133153in the bin/ directory.
134154
135155# Example Usage:
136156
137- python AlignmentProcessor.py -% <decimal> --retainStops \
138- --axt/phylip --kaks/codeml --ucsc -i <input fasta file> \
139- -o <path to output directory> -r <reference species>
157+ python AlignmentProcessor.py --ucsc --axt/phylip --kaks/codeml \
158+ --retainStops -% <decimal> -r <reference species> -i <input fasta file> \
159+ -o <path to output directory>
140160
141161# Required Arguments:
142162
@@ -148,7 +168,8 @@ in the bin/ directory.
148168
149169 --ucsc This will invoke 00_convertHeader.py, which will convert the
150170 headers from UCSC fasta files so they only contain build
151- names and gene IDs.
171+ names and gene IDs. This does not need to be run on Stich Gene
172+ Blocks output.
152173
153174 --axt/phylip Specifies which format to convert the files into
154175 (axt format for KaKs_Calculator; phylip for CodeML
@@ -163,13 +184,12 @@ in the bin/ directory.
163184 control file for CodeML which must be located in the
164185 output directory you specified with the the -o option.
165186 This file must be titled "codeml.ctl" (the default
166- name given by PAML. The name is hard coded into
167- AlignmentProcessor to simplify the commands).
187+ name given by PAML).
168188
169189 --retainStops This will tell AlignmentProcessor to retain sequences
170190 that contain internal stop codons. By default,
171191 sequences with internal stop codons will be removed
172- from the analysis.
192+ from the analysis as they may bias the results .
173193
174194 -% a decimal value specifying the minimum percentage of reads
175195 that must remain after replacing unknown codons with gaps
@@ -184,10 +204,11 @@ in the bin/ directory.
184204
185205 --printNameList will print the contents of 02_nameList.txt which
186206 contains the list of genome builds and associated
187- common names.
207+ common names. Must be run without any other arguments
188208
189209 --addNameToList will add an entry to the 02_nameList.txt file.
190- e.g. python AlignmentProcessor.py -- addNameToList <build> <common name>
210+ e.g. python AlignmentProcessor.py -- addNameToList \
211+ <build> <common name>
191212
192213
193214# Genome Builds and Common Names
@@ -201,7 +222,7 @@ e.g. python AlignmentProcessor.py -- addNameToList <build> <common name>
201222 new row, followed by a tab, then the desired common name. Make sure
202223 there are no spaces in either name.
203224
204- # The codeml control file
225+ # The CodeML control file
205226
206227 Codeml requires that all of its parameters be specified in one control
207228 file (http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf). Provide a
@@ -211,7 +232,20 @@ e.g. python AlignmentProcessor.py -- addNameToList <build> <common name>
211232 tree file for codeml (see PAML manual above).
212233
213234 The control file must be titled titled “codeml.ctl”, and it must be
214- locatedin the output directory.
235+ located in the output directory.
236+
237+ # The CodeML tree file
238+
239+ If your CodeML analysis requires a phylogneic tree, provide your
240+ desired tree, titled "codeml.tree" in the output directory. Specify
241+ the tree as you want to appear to CodeML and be sure that the species
242+ names are specified as the common names in the 02_nameList.txt file,
243+ but trimmed to ten characters (some programs still set a ten character
244+ limit on the length of names, so AlignmentProcessor trims the names).
245+ The 07_CodeMLonDir.py script will save any nodes you have specified
246+ before sending a plain Newick tree to ape (which will not work if
247+ there are PAML node symbols). It will then add any nodes back into the
248+ tree after it has been trimmed.
215249
216250# Invoking the Ka/Ks pipeline with a UCSC alignment:
217251
@@ -231,8 +265,8 @@ Each script performs one or two functions on the input file or files, and
231265saves the output to a new subdirectory. The location of the working directory
232266must be specified, but the subdirectories are hard-coded in the programs.
233267Be sure to include the trailing "/" in the path. The AlignmentProcessor
234- wrapper will add it, but, to avoid redundancies, the individual scripts
235- will not.
268+ wrapper will add it, but to avoid redundancies the individual scripts will
269+ not.
236270
237271Remember that the order of the arguments does matter for these scripts.
238272
@@ -246,17 +280,17 @@ Remember that the order of the arguments does matter for these scripts.
246280# 01_SplitFastaFiles.py
247281
248282 This script will split the input mult-fasta alignment into one file
249- per gene.
283+ per gene. It will produce an output file for a gene if it has at least
284+ two sequences.
250285
251- python 01_splitFastaFiles.py <# of species in alignment> \
252- <input fasta alignment> < path to output directory>
286+ python 01_splitFastaFiles.py <input fasta alignment> \
287+ <path to output directory>
253288
254289# 02_RemoveHeader.py
255290
256291 This program will read through a directory that contains
257- aligned multiple FASTA files and remove FASTA header and all
258- the trailing information for each gene and change all the
259- genome build names to common names.
292+ aligned multiple FASTA files and replace FASTA headers with each
293+ species' common name.
260294
261295 python 02_RemoveHeaderOnDir.py <path to inut and output directories>
262296
@@ -269,21 +303,20 @@ Remember that the order of the arguments does matter for these scripts.
269303 frame. It will then replaces codons with missing nucleotides with gaps
270304 to remove unknown amino acids from the sequence.
271305
272- python 03_CheckFrameOnDir.py <# of species > \
273- <path to inut and output directories> < reference_species>
306+ python 03_CheckFrameOnDir.py <path to inut and output directories > \
307+ <reference_species>
274308
275309# 04_CountBases.py
276310
277311 This program will check a multiple FASTA file to see that each species
278312 retains a certain percentage of its nucleotide sequence. If not, it
279313 will remove that sequence.
280314
281- Note: the AlignmentProcessor wrapper specifies 50% as adefault value,
315+ Note: the AlignmentProcessor wrapper specifies 50% as a default value,
282316 but the script itself does not, so you must specify one if you invoke
283317 it on its own.
284318
285- python 05_CountBasesOnDir.py <number of species> \
286- <threshold percentage as a decimal> \
319+ python 05_CountBasesOnDir.py <threshold percentage as a decimal> \
287320 <path to inut and output directories>
288321
289322# 05_ReplaceStopCodons.py
@@ -294,16 +327,18 @@ Remember that the order of the arguments does matter for these scripts.
294327 codon.
295328
296329 Terminal stop codons will be replaced, while sequences with internal
297- stop codons will have their gene id, as well sequence name and
298- location of the first occuring stop codon, recorded in the
299- internalStops.txt file.
330+ stop codons will have their gene id and sequence name recorded in the
331+ internalStops.txt file. If --retainStops is specified, these sequences
332+ will be retained. If it is not, these sequences will be removed and
333+ any which does not have at least two remaining sequences will not be
334+ written to file.
300335
301- python 05_ReplaceStopCodonsOnDir.py <number of species> \
302- <path to inut and output directories>
336+ python 05_ReplaceStopCodonsOnDir.py \
337+ <path to inut and output directories> --retainStops(optional)
303338
304339# 06_FASTAtoAXT.py
305340
306- This program executes 06_parseFastaIntoAXT.pl on an entire direcotry ,
341+ This program executes 06_parseFastaIntoAXT.pl on an entire directory ,
307342 allowing all of the contents of the directory to be converted to
308343 axt files.
309344
@@ -322,7 +357,7 @@ Remember that the order of the arguments does matter for these scripts.
322357
323358# 07_KaKsonDir.py
324359
325- This program executes KaKs_Calculator on every file in a direcotry .
360+ This program executes KaKs_Calculator on every file in a directory .
326361
327362 python 07_KaKsonDirectory.py <path to inut and output directories> \
328363 <name of refernce species>
@@ -331,10 +366,22 @@ Remember that the order of the arguments does matter for these scripts.
331366
332367 This script will run codeml on every file in a directory. It requires
333368 the codeml.ctl file, and likely a tree file which it will supply to
334- codeml.
369+ codeml. It will overwrite the "seqfile", "treefile", "outfile" lines
370+ include the paths to the input phylip file, the output file, and the
371+ tree file. It will also call the ape R package to trim the tree file
372+ so that it only includes species which have not been filtered out.
335373
336374 python 07_CodeMLonDir.py <path to codeml control file> \
337- <path to input and output directorie>
375+ <path to input and output directories>
376+
377+ 3 07_pruneTree.R
378+
379+ This R script will call the ape package to dynamically trim input
380+ trees for CodeML if any sequences have been removed. Species
381+ whose sequences were removed in steps 4 or 5 will be removed from
382+ the temporary tree given to CodeML.
383+
384+ (Caled by 07_CodeMLonDir.py)
338385
339386# 08_compileKaKs_CSV.py
340387
@@ -361,26 +408,37 @@ printed to the output directory that you specified with the -o option.
361408
362409If --phylip is specified, the program will convert the fasta files to phylip
363410after the stop codons have been removed. If you also specified --codeml,
364- AlignmentProcessor will edit and submit the control file to codeml and the
411+ AlignmentProcessor will edit and submit the control file to CodeML and the
365412output files will be saved in 07_codeml. Since there are many different things
366413that can be done with the codeml output files, AlignmentProcessor does not
367414attmept to concatenate specific parts from the output files.
368415
369- If you wish to convert convert the files to both formats, with or without the
370- --kaks option, specify both --axt and --phylip the program will convert the
371- fasta files to both formats. You may also run one of the individual scripts on
372- the 07_rmStops directory to convert the files in a separate step.
373- AlignmentProcessor will not, however, run KaKs_Calculator and codeml
374- simultaneously, as this could require too much memory.
416+ If you wish to convert convert the files to both formats, specify both --axt
417+ and --phylip the program will convert the fasta files to both formats. You may
418+ also run one of the individual scripts on the 07_rmStops directory to convert
419+ the files in a separate step. AlignmentProcessor will not, however, run
420+ KaKs_Calculator and CodeML simultaneously, as this could require too much
421+ memory.
375422
376423#-------------------------------
377424# 5. Run AlignmentProcessor on test data
378425#-------------------------------
379426
427+ # To test KaKs_Calculator:
380428Change directory into the AlignmentProcessor folder. Paste the followig into
381- a terminal (change the output directory to the desired loaction) :
429+ a terminal:
382430
383431python AlignmentProcessor.py --axt --kaks --ucsc -r Green_anole \
384- -i test .fa -o test/
432+ -i kaksTest .fa -o test/
385433
386434This will return a text file with 11 lines.
435+
436+ # To test CodeML:
437+ The test directory already contains sample CodeML control and tree files, so
438+ all you need to do is change into the AlignmentProcessor direcotry and paste
439+ the following:
440+
441+ python AlignmentProcessor.py --phylip --codeml --ucsc -r Green_anole \
442+ -i codemlTest.fa -o test/
443+
444+ There should be 8 .mlc files in the 07_codeml directory.
0 commit comments