1212
1313
1414###############################################################
15- # AlignmentProcessor0.8 Package
15+ # AlignmentProcessor0.11 Package
1616#
1717# Dependencies: Python 3
1818# Python 3 version of Biopython
19- # Perl
20- # PAML
21- # R
22- # ape R package
19+ # Perl (if using KaKs_Calculator)
20+ # PAML (if using CodeML)
21+ # R (if using CodeML)
22+ # ape R package (if using CodeML)
2323###############################################################
2424
2525### Contents ###
@@ -63,10 +63,10 @@ a terminal and Anaconda will install Biopython for you:
6363
6464# KaKs_Calculator
6565
66- AlignmentProcessor0.8 is packaged with KaKs_Calculator2.0 binaries for Linux
66+ AlignmentProcessor0.11 is packaged with KaKs_Calculator2.0 binaries for Linux
6767and Windows, and a KaKs_Calculator1.2 binary for Mac (there is no 2.0 binary
6868available for OSX). Before using, copy or move the appropriate binary for your
69- system into the AlignmentProcessor bin which contains the python scipts .
69+ system into the AlignmentProcessor bin which contains the python scripts .
7070
7171# PAML 4.8
7272
@@ -97,14 +97,14 @@ This does, unfortunately, limit you to currently available alignments.
9797If they do have the alignment that you are interested in, however, it will
9898probably be faster to download it rather than generate a new one. Since this
9999precludes the use of user-generated alignments, AlignmentProcessor has been
100- written for Galaxy's Stitch Gene Blocks ouput . If you choose to use UCSC
100+ written for Galaxy's Stitch Gene Blocks output . If you choose to use UCSC
101101alignments, the sequence headers will have to be converted using the --ucsc
102102option.
103103
104104# User Generated Alignments
105105Since most alignments are in maf format, you will have to convert your
106106alignment from maf to fasta. There seem to be very few programs that can do
107- this; fortunately Galaxy's Stich Gene blocks not only converts a maf to fasta,
107+ this; fortunately Galaxy's Stitch Gene blocks not only converts a maf to fasta,
108108but it also separates sequences by genes, which is something we need to do
109109anyway.
110110
@@ -123,9 +123,9 @@ the Get Data tab on the left of the screen). Select your species from the UCSC
123123table browser, select Genes and Gene Predictions, ensembl genes, and the whole
124124genome. Select BED as the output format and check the send to Galaxy box.
125125
126- Once you have your data uploaded to Galaxy, select the Stich Gene Blocks tool
126+ Once you have your data uploaded to Galaxy, select the Stitch Gene Blocks tool
127127under the Fetch Alignments/Sequences tab. Select your reference species'
128- genome BED file in the Gene BED file dropdown menu. Change MAF Sourse to the
128+ genome BED file in the Gene BED file drop down menu. Change MAF Source to the
129129maf file you uploaded (you may also use a locally catched alignment if it is
130130available for all of your species of interest). Select the desired species
131131IDs, leave Split into Gapless MAF Blocks to "no" (we will deal with gaps
@@ -135,14 +135,14 @@ output file.
135135This process will take a few hours, so plan accordingly.
136136
137137Since this method offer the most flexibility for working with alignments,
138- AlignmentProcessor was wrtien with this output format in mind and no further
138+ AlignmentProcessor was written with this output format in mind and no further
139139formatting is required.
140140
141141#-------------------------------
142142# 2. Running AlignmentProcessor
143143#-------------------------------
144144
145- AlignmentProcessor is designed to convert the file into a useable format and
145+ AlignmentProcessor is designed to convert the file into a usable format and
146146run the substitutions quickly, so everything can be run with one command. Each
147147script can be run individually if necessary (each script's function and
148148options will be discussed later). The input order for AlignmentProcessor's
@@ -152,6 +152,14 @@ To execute the AlignmentProcessor pipeline, you must first change into
152152the package directory. Otherwise it will not be able to locate the scripts
153153in the bin/ directory.
154154
155+ If you are running CodeML and the program is interrupted, you may call the
156+ 07_CodeMLonDir.py script and it will continue where CodeML left off. This
157+ will save the time of having to run those genes through CodeML again (It will
158+ do the same thing if you call the entire pipeline again, but there is no need
159+ to rerun the previous steps). It will not do the same for KaKs_Calculator
160+ since KaKs_Calculator is much faster, so it should not be a problem to just
161+ invoke KaKs_Calculator on the whole directory again.
162+
155163# Example Usage:
156164
157165 python AlignmentProcessor.py --ucsc --axt/phylip --kaks/codeml \
@@ -168,7 +176,7 @@ in the bin/ directory.
168176
169177 --ucsc This will invoke 00_convertHeader.py, which will convert the
170178 headers from UCSC fasta files so they only contain build
171- names and gene IDs. This does not need to be run on Stich Gene
179+ names and gene IDs. This does not need to be run on Stitch Gene
172180 Blocks output.
173181
174182 --retainStops This will tell AlignmentProcessor to retain sequences
@@ -177,7 +185,7 @@ in the bin/ directory.
177185 from the analysis as they may bias the results.
178186
179187 --changeNames Tells the program to change genome build names to
180- commom names (more below).
188+ common names (more below).
181189
182190 -% a decimal value specifying the minimum percentage of reads
183191 that must remain after replacing unknown codons with gaps
@@ -201,7 +209,7 @@ in the bin/ directory.
201209 This file must be titled "codeml.ctl" (the default
202210 name given by PAML).
203211
204- -n if "--codeml" is selected, you may specify the number of CPUs
212+ -t if "--codeml" is selected, you may specify the number of CPUs
205213 to run CodeML. CodeML itself cannot be parallelized, but
206214 AlignmentProcessor can call multiple instances of CodeML to
207215 shorten overall run time. (Default = 1)
@@ -210,7 +218,7 @@ in the bin/ directory.
210218
211219 -h/--help will print the program's help dialogue
212220
213- -v/--version will print the program version and copywright info
221+ -v/--version will print the program version and copyright info
214222
215223 --printNameList will print the contents of 02_nameList.txt which
216224 contains the list of genome builds and associated
@@ -237,16 +245,19 @@ in the bin/ directory.
237245 Codeml requires that all of its parameters be specified in one control
238246 file (http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf). Provide a
239247 control file with your desired parameters and AlignmentProcessor will
240- use it as template. It will only alter the input and output files so
241- that they are unique for each file. You may also need to provide a
242- tree file for codeml (see PAML manual above).
248+ use it as template. AlignmentProcessor uses the Biopython codeml module
249+ to more easily edit the names of the input and output files. Because of
250+ this, however, you must make sure that the first three lines (starting
251+ with "seqfile", "treefile", or "outfile" have been removed from the
252+ control file. Otherwise, CodeML will presented only with the data
253+ contained in the control file, and not the whole directory of alingments.
243254
244255 The control file must be titled titled “codeml.ctl”, and it must be
245256 located in the output directory.
246257
247258# The CodeML tree file
248259
249- If your CodeML analysis requires a phylogneic tree, provide your
260+ If your CodeML analysis requires a phylogenic tree, provide your
250261 desired tree, titled "codeml.tree" in the output directory. Specify
251262 the tree as you want to appear to CodeML and be sure that the species
252263 names are specified as the common names in the 02_nameList.txt file,
@@ -261,13 +272,13 @@ in the bin/ directory.
261272
262273# Invoking the Ka/Ks pipeline with a UCSC alignment:
263274
264- python AlignmentProcessor0.8 .py --axt --kaks --ucsc -r green_anole \
275+ python AlignmentProcessor0.11 .py --axt --kaks --ucsc -r anoCar2 \
265276 -i anolis_gallus.fa -o pairwiseKaKs/
266277
267278# Invoking the CodeML pipeline with a de novo alignment:
268279
269- python AlignmentProcessor0.8 .py --phylip --codeml -% 0.6 \
270- -r green_anole -i anolis_gallus.fa -o codemlOutput/
280+ python AlignmentProcessor0.11 .py --phylip --codeml -% 0.6 \
281+ -r anoCar2 -i anolis_gallus.fa -o codemlOutput/
271282
272283#-------------------------------
273284# 3. Individual Scripts
@@ -291,7 +302,7 @@ Remember that the order of the arguments does matter for these scripts.
291302
292303# 01_SplitFastaFiles.py
293304
294- This script will split the input mult -fasta alignment into one file
305+ This script will split the input multi -fasta alignment into one file
295306 per gene. It will produce an output file for a gene if it has at least
296307 two sequences.
297308
@@ -304,18 +315,18 @@ Remember that the order of the arguments does matter for these scripts.
304315 aligned multiple FASTA files and replace FASTA headers with each
305316 species' common name.
306317
307- python 02_RemoveHeaderOnDir.py <path to inut and output directories>
318+ python 02_RemoveHeaderOnDir.py <path to input and output directories>
308319
309320# 03_CheckFrame.py
310321
311322 This script removes gaps introduced in the reference sequence by the
312323 alignment and removes corresponding sites in other species. It assumes
313- that the reference sequnce was in frame before any gaps were inserted,
324+ that the reference sequence was in frame before any gaps were inserted,
314325 and it returns the reference sequence to its original open reading
315326 frame. It will then replaces codons with missing nucleotides with gaps
316327 to remove unknown amino acids from the sequence.
317328
318- python 03_CheckFrameOnDir.py <path to inut and output directories> \
329+ python 03_CheckFrameOnDir.py <path to input and output directories> \
319330 <reference_species>
320331
321332# 04_CountBases.py
@@ -329,13 +340,13 @@ Remember that the order of the arguments does matter for these scripts.
329340 it on its own.
330341
331342 python 05_CountBasesOnDir.py <threshold percentage as a decimal> \
332- <path to inut and output directories>
343+ <path to input and output directories>
333344
334345# 05_ReplaceStopCodons.py
335346
336347 This program will remove the internal stop codons (TAA, TAG, TGA)
337348 and replace with gaps (---) from the nucleotide alignment. Some
338- programs will not run properly if they enounter a premature stop
349+ programs will not run properly if they encounter a premature stop
339350 codon.
340351
341352 Terminal stop codons will be replaced, while sequences with internal
@@ -346,7 +357,7 @@ Remember that the order of the arguments does matter for these scripts.
346357 written to file.
347358
348359 python 05_ReplaceStopCodonsOnDir.py \
349- <path to inut and output directories> --retainStops(optional)
360+ <path to input and output directories> --retainStops(optional)
350361
351362# 06_FASTAtoAXT.py
352363
@@ -355,24 +366,24 @@ Remember that the order of the arguments does matter for these scripts.
355366 axt files.
356367
357368 Note: parseFastaIntoAXT.pl was provided by the developers of
358- KaKs_Calculator and, as such, is the only perl script in this package.
369+ KaKs_Calculator and, as such, is the only Perl script in this package.
359370
360- python FASTAtoAXTonDirectory.py <path to inut and output directories>
371+ python FASTAtoAXTonDirectory.py <path to input and output directories>
361372
362373# 06_FASTAtoPhylip.py
363374
364375 This program will convert all files in an input directory
365376 from fasta format to a phylip format.
366377
367378 python 07_FASTAtoPhylip.py <number of species> \
368- <path to inut and output directories>
379+ <path to input and output directories>
369380
370381# 07_KaKsonDir.py
371382
372383 This program executes KaKs_Calculator on every file in a directory.
373384
374- python 07_KaKsonDirectory.py <path to inut and output directories> \
375- <name of refernce species>
385+ python 07_KaKsonDirectory.py <path to input and output directories> \
386+ <name of reference species>
376387
377388# 07_CodeMLonDir.py
378389
@@ -381,28 +392,30 @@ Remember that the order of the arguments does matter for these scripts.
381392 codeml. It will overwrite the "seqfile", "treefile", "outfile" lines
382393 include the paths to the input phylip file, the output file, and the
383394 tree file. It will also call the ape R package to trim the tree file
384- so that it only includes species which have not been filtered out.
395+ so that it only includes species which have not been filtered out. If you
396+ are running CodeML and the program is interrupted, you may invoke this
397+ script to pick up where you left off.
385398
386399 python 07_CodeMLonDir.py <path to codeml control file> \
387- <path to input and output directories> \
388- --retainStops(optional)
400+ <path to input and output directories>
389401
390- 3 07_pruneTree.R
402+ # 07_pruneTree.py
391403
392- This R script will call the ape package to dynamically trim input
393- trees for CodeML if any sequences have been removed. Species
394- whose sequences were removed in steps 4 or 5 will be removed from
404+ This script will dynamically trim input trees for CodeML if any sequences
405+ have been removed. Species whose sequences were removed in steps 4 or 5
406+ and are no longer in the phylip alignment will be removed from
395407 the temporary tree given to CodeML.
396408
397- (Caled by 07_CodeMLonDir.py)
409+ python 07_pruneTree.py <path to input directory> \
410+ <list of species remaining in alignment> <path to tmep output directory>
398411
399- # 08_compileKaKs_CSV .py
412+ # 08_compileKaKs .py
400413
401414 This script concatonates the output from KaKs_Calculator into a text
402415 file. It adds a column for gene (or sequence) IDs, and prints the gene
403416 ID from the filename.
404417
405- python compileCSV.py <path to inut and output directories>
418+ python compileCSV.py <path to input and output directories>
406419
407420#-------------------------------
408421# 4. Outputs
@@ -424,7 +437,7 @@ after the stop codons have been removed. If you also specified --codeml,
424437AlignmentProcessor will edit and submit the control file to CodeML and the
425438output files will be saved in 07_codeml. Since there are many different things
426439that can be done with the codeml output files, AlignmentProcessor does not
427- attmept to concatenate specific parts from the output files.
440+ attempt to concatenate specific parts from the output files.
428441
429442If you wish to convert convert the files to both formats, specify both --axt
430443and --phylip the program will convert the fasta files to both formats. You may
@@ -438,7 +451,7 @@ memory.
438451#-------------------------------
439452
440453# To test KaKs_Calculator:
441- Change directory into the AlignmentProcessor folder. Paste the followig into
454+ Change directory into the AlignmentProcessor folder. Paste the following into
442455a terminal:
443456
444457python AlignmentProcessor.py --axt --kaks --ucsc -r anoCar2 \
@@ -448,10 +461,10 @@ This will return a text file with 11 lines.
448461
449462# To test CodeML:
450463The test directory already contains sample CodeML control and tree files, so
451- all you need to do is change into the AlignmentProcessor direcotry and paste
464+ all you need to do is change into the AlignmentProcessor directory and paste
452465the following:
453466
454- python AlignmentProcessor.py --phylip --codeml --ucsc -n 2 -r anoCar2 \
467+ python AlignmentProcessor.py --phylip --codeml --ucsc -t 2 -r anoCar2 \
455468-i codemlTest.fa -o test/
456469
457470There should be 8 .mlc files in the 07_codeml directory.
0 commit comments