Skip to content

Commit bf800ea

Browse files
committed
updated, production version
1 parent 503e336 commit bf800ea

24 files changed

Lines changed: 1014 additions & 269 deletions

AlignmentProcessor.py

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ def makeDir(path, outdir, axt, phylip):
6969
'''Makes all sub-directories used by program.'''
7070
os.chdir(outdir)
7171
for i in ["01_splitFastaFiles", "02_rmHeader", "03_checkFrame",
72-
"04_countBasesPercent", "05_ReplaceStopCodons"]:
72+
"04_countBasesPercent", "05_ReplaceStopCodons", "Logs"]:
7373
try:
7474
os.mkdir(i)
7575
except FileExistsError:
@@ -142,8 +142,12 @@ def countBases(outdir, percent):
142142
def replaceStop(outdir, retainStops):
143143
'''Removes stop codons for downstream analysis.'''
144144
print("Removing stop codons...")
145-
rs = Popen(split("python bin/05_ReplaceStopCodons.py " + " "
146-
+ outdir + " " + str(retainStops)))
145+
if retainStops == False:
146+
rs = Popen(split("python bin/05_ReplaceStopCodons.py " + " "
147+
+ outdir))
148+
elif retainStops == True:
149+
rs = Popen(split("python bin/05_ReplaceStopCodons.py " + " "
150+
+ outdir + " retainStops"))
147151
rs.wait()
148152
if rs.returncode == 0:
149153
return True
@@ -192,8 +196,7 @@ def phylipConvert(outdir, starttime, codeml):
192196
def runcodeml(outdir, starttime):
193197
'''Runs codeml on a directory.'''
194198
print("Running codeml...")
195-
cm = Popen(split("python bin/07_CodeMLonDir.py " + outdir + "codeml.ctl "
196-
+ outdir))
199+
cm = Popen(split("python bin/07_CodeMLonDir.py " + outdir))
197200
cm.wait()
198201
if cm.returncode == 0:
199202
elapsedtime = datetime.now() - starttime
@@ -207,6 +210,10 @@ def helplist():
207210
print(" example usage: python AlignmentProcessor.py -% <decimal> \
208211
--axt/phylip --kaks/codeml -i <input fasta file> -o \
209212
<path to output directory> -r <reference species>")
213+
print(" --printNameList prints list of genome build names and\
214+
associated common names")
215+
print(" --addNameToList add new genome build name and\
216+
associated common name to list")
210217
print(" --axt Converts files to axt for use in KaKs_Calcuator.")
211218
print(" --phylip Converts files to phylip for use in PhyML.")
212219
print(" --kaks Runs KaKs_Calcuator if --axt is also specified")
@@ -236,6 +243,7 @@ def main():
236243
codeml = False
237244
conv = False
238245
percent = "0.5"
246+
ref = "void"
239247
# Extract values from command line:
240248
for i in argv:
241249
if i == "-h" or i == "--help":
@@ -274,6 +282,11 @@ def main():
274282
elif i == "--retainStops":
275283
retainStops = True
276284
# Check inout commands prior to running:
285+
if ref == "void":
286+
print()
287+
print("\tPlease pecify a reference species.")
288+
print()
289+
quit()
277290
checkInput(axt, kaks, phylip, codeml)
278291
# Save working directory to variable to call other scripts:
279292
path = os.getcwd()

AlignmentProcessorReadMe.txt

Lines changed: 106 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
# Python 3 version of Biopython
1919
# Perl
2020
# PAML
21+
# R
22+
# ape R package
2123
###############################################################
2224

2325
### Contents ###
@@ -32,10 +34,16 @@
3234
# 0. Introduction
3335
#-------------------------------
3436

35-
AlignmentProcessor is a pipeline mean to quickly convert a multi-fasta
37+
AlignmentProcessor is a pipeline meant to quickly convert a multi-fasta
3638
alignment file into a format that can be read by KaKs_Calculator or PAML and
37-
optionally run those programs. Each python script can be run individually, or
38-
you can run the AlignmentProcessor wrapper which will call all of the files.
39+
optionally run those programs. When running codeml, alignment processor will
40+
use input control and tree files as templates to create unique control and
41+
tree files for each gene. It will call the ape R package to dynamically trim
42+
the given phylogenic tree so that only species which remain in each gene's
43+
alignment after trimming are represented in that gene's tree.
44+
45+
You can run the AlignmentProcessor wrapper which will call all of the python
46+
scripts in sequence, or you may call each script individually.
3947

4048
# Installing Biopython
4149

@@ -66,12 +74,26 @@ If you plan to use CodeML, you must first download PAML
6674
(http://abacus.gene.ucl.ac.uk/software/paml.html) and move the folder into the
6775
AlignmentProcessor directory. Make sure that it is titled "paml".
6876

77+
# Ape
78+
79+
The most straightforward way to install ape, and most R packages, is through
80+
Bioconductor. If you do not have Bioconductor installed, open R and paste:
81+
82+
source("https://bioconductor.org/biocLite.R")
83+
biocLite()
84+
85+
To install ape, enter:
86+
87+
library("BiocInstaller")
88+
biocLite("ape")
89+
90+
6991
#-------------------------------
7092
# 1. Obtaining a fasta alignment
7193
#-------------------------------
7294
# UCSC Fasta Alignment
7395
It is possible to download CDS fasta alignments from the UCSC Table browser.
74-
This does, unfortunately, limit you to the alignments that they have available.
96+
This does, unfortunately, limit you to currently available alignments.
7597
If they do have the alignment that you are interested in, however, it will
7698
probably be faster to download it rather than generate a new one. Since this
7799
precludes the use of user-generated alignments, AlignmentProcessor has been
@@ -125,18 +147,16 @@ run the substitutions quickly, so everything can be run with one command. Each
125147
script can be run individually if necessary (each script's function and
126148
options will be discussed later). The input order for AlignmentProcessor's
127149
command line does not matter, but it does matter for the individual scripts.
128-
AlignmentProcessor also comes packaged with KaKs_Calculator2.0 and PAML4.8 for
129-
downstream analysis.
130150

131151
To execute the AlignmentProcessor pipeline, you must first change into
132152
the package directory. Otherwise it will not be able to locate the scripts
133153
in the bin/ directory.
134154

135155
# Example Usage:
136156

137-
python AlignmentProcessor.py -% <decimal> --retainStops \
138-
--axt/phylip --kaks/codeml --ucsc -i <input fasta file> \
139-
-o <path to output directory> -r <reference species>
157+
python AlignmentProcessor.py --ucsc --axt/phylip --kaks/codeml \
158+
--retainStops -% <decimal> -r <reference species> -i <input fasta file> \
159+
-o <path to output directory>
140160

141161
# Required Arguments:
142162

@@ -148,7 +168,8 @@ in the bin/ directory.
148168

149169
--ucsc This will invoke 00_convertHeader.py, which will convert the
150170
headers from UCSC fasta files so they only contain build
151-
names and gene IDs.
171+
names and gene IDs. This does not need to be run on Stich Gene
172+
Blocks output.
152173

153174
--axt/phylip Specifies which format to convert the files into
154175
(axt format for KaKs_Calculator; phylip for CodeML
@@ -163,13 +184,12 @@ in the bin/ directory.
163184
control file for CodeML which must be located in the
164185
output directory you specified with the the -o option.
165186
This file must be titled "codeml.ctl" (the default
166-
name given by PAML. The name is hard coded into
167-
AlignmentProcessor to simplify the commands).
187+
name given by PAML).
168188

169189
--retainStops This will tell AlignmentProcessor to retain sequences
170190
that contain internal stop codons. By default,
171191
sequences with internal stop codons will be removed
172-
from the analysis.
192+
from the analysis as they may bias the results.
173193

174194
-% a decimal value specifying the minimum percentage of reads
175195
that must remain after replacing unknown codons with gaps
@@ -184,10 +204,11 @@ in the bin/ directory.
184204

185205
--printNameList will print the contents of 02_nameList.txt which
186206
contains the list of genome builds and associated
187-
common names.
207+
common names. Must be run without any other arguments
188208

189209
--addNameToList will add an entry to the 02_nameList.txt file.
190-
e.g. python AlignmentProcessor.py -- addNameToList <build> <common name>
210+
e.g. python AlignmentProcessor.py -- addNameToList \
211+
<build> <common name>
191212

192213

193214
# Genome Builds and Common Names
@@ -201,7 +222,7 @@ e.g. python AlignmentProcessor.py -- addNameToList <build> <common name>
201222
new row, followed by a tab, then the desired common name. Make sure
202223
there are no spaces in either name.
203224

204-
# The codeml control file
225+
# The CodeML control file
205226

206227
Codeml requires that all of its parameters be specified in one control
207228
file (http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf). Provide a
@@ -211,7 +232,20 @@ e.g. python AlignmentProcessor.py -- addNameToList <build> <common name>
211232
tree file for codeml (see PAML manual above).
212233

213234
The control file must be titled titled “codeml.ctl”, and it must be
214-
locatedin the output directory.
235+
located in the output directory.
236+
237+
# The CodeML tree file
238+
239+
If your CodeML analysis requires a phylogneic tree, provide your
240+
desired tree, titled "codeml.tree" in the output directory. Specify
241+
the tree as you want to appear to CodeML and be sure that the species
242+
names are specified as the common names in the 02_nameList.txt file,
243+
but trimmed to ten characters (some programs still set a ten character
244+
limit on the length of names, so AlignmentProcessor trims the names).
245+
The 07_CodeMLonDir.py script will save any nodes you have specified
246+
before sending a plain Newick tree to ape (which will not work if
247+
there are PAML node symbols). It will then add any nodes back into the
248+
tree after it has been trimmed.
215249

216250
# Invoking the Ka/Ks pipeline with a UCSC alignment:
217251

@@ -231,8 +265,8 @@ Each script performs one or two functions on the input file or files, and
231265
saves the output to a new subdirectory. The location of the working directory
232266
must be specified, but the subdirectories are hard-coded in the programs.
233267
Be sure to include the trailing "/" in the path. The AlignmentProcessor
234-
wrapper will add it, but, to avoid redundancies, the individual scripts
235-
will not.
268+
wrapper will add it, but to avoid redundancies the individual scripts will
269+
not.
236270

237271
Remember that the order of the arguments does matter for these scripts.
238272

@@ -246,17 +280,17 @@ Remember that the order of the arguments does matter for these scripts.
246280
# 01_SplitFastaFiles.py
247281

248282
This script will split the input mult-fasta alignment into one file
249-
per gene.
283+
per gene. It will produce an output file for a gene if it has at least
284+
two sequences.
250285

251-
python 01_splitFastaFiles.py <# of species in alignment> \
252-
<input fasta alignment> <path to output directory>
286+
python 01_splitFastaFiles.py <input fasta alignment> \
287+
<path to output directory>
253288

254289
# 02_RemoveHeader.py
255290

256291
This program will read through a directory that contains
257-
aligned multiple FASTA files and remove FASTA header and all
258-
the trailing information for each gene and change all the
259-
genome build names to common names.
292+
aligned multiple FASTA files and replace FASTA headers with each
293+
species' common name.
260294

261295
python 02_RemoveHeaderOnDir.py <path to inut and output directories>
262296

@@ -269,21 +303,20 @@ Remember that the order of the arguments does matter for these scripts.
269303
frame. It will then replaces codons with missing nucleotides with gaps
270304
to remove unknown amino acids from the sequence.
271305

272-
python 03_CheckFrameOnDir.py <# of species> \
273-
<path to inut and output directories> <reference_species>
306+
python 03_CheckFrameOnDir.py <path to inut and output directories> \
307+
<reference_species>
274308

275309
# 04_CountBases.py
276310

277311
This program will check a multiple FASTA file to see that each species
278312
retains a certain percentage of its nucleotide sequence. If not, it
279313
will remove that sequence.
280314

281-
Note: the AlignmentProcessor wrapper specifies 50% as adefault value,
315+
Note: the AlignmentProcessor wrapper specifies 50% as a default value,
282316
but the script itself does not, so you must specify one if you invoke
283317
it on its own.
284318

285-
python 05_CountBasesOnDir.py <number of species> \
286-
<threshold percentage as a decimal> \
319+
python 05_CountBasesOnDir.py <threshold percentage as a decimal> \
287320
<path to inut and output directories>
288321

289322
# 05_ReplaceStopCodons.py
@@ -294,16 +327,18 @@ Remember that the order of the arguments does matter for these scripts.
294327
codon.
295328

296329
Terminal stop codons will be replaced, while sequences with internal
297-
stop codons will have their gene id, as well sequence name and
298-
location of the first occuring stop codon, recorded in the
299-
internalStops.txt file.
330+
stop codons will have their gene id and sequence name recorded in the
331+
internalStops.txt file. If --retainStops is specified, these sequences
332+
will be retained. If it is not, these sequences will be removed and
333+
any which does not have at least two remaining sequences will not be
334+
written to file.
300335

301-
python 05_ReplaceStopCodonsOnDir.py <number of species> \
302-
<path to inut and output directories>
336+
python 05_ReplaceStopCodonsOnDir.py \
337+
<path to inut and output directories> --retainStops(optional)
303338

304339
# 06_FASTAtoAXT.py
305340

306-
This program executes 06_parseFastaIntoAXT.pl on an entire direcotry,
341+
This program executes 06_parseFastaIntoAXT.pl on an entire directory,
307342
allowing all of the contents of the directory to be converted to
308343
axt files.
309344

@@ -322,7 +357,7 @@ Remember that the order of the arguments does matter for these scripts.
322357

323358
# 07_KaKsonDir.py
324359

325-
This program executes KaKs_Calculator on every file in a direcotry.
360+
This program executes KaKs_Calculator on every file in a directory.
326361

327362
python 07_KaKsonDirectory.py <path to inut and output directories> \
328363
<name of refernce species>
@@ -331,10 +366,22 @@ Remember that the order of the arguments does matter for these scripts.
331366

332367
This script will run codeml on every file in a directory. It requires
333368
the codeml.ctl file, and likely a tree file which it will supply to
334-
codeml.
369+
codeml. It will overwrite the "seqfile", "treefile", "outfile" lines
370+
include the paths to the input phylip file, the output file, and the
371+
tree file. It will also call the ape R package to trim the tree file
372+
so that it only includes species which have not been filtered out.
335373

336374
python 07_CodeMLonDir.py <path to codeml control file> \
337-
<path to input and output directorie>
375+
<path to input and output directories>
376+
377+
3 07_pruneTree.R
378+
379+
This R script will call the ape package to dynamically trim input
380+
trees for CodeML if any sequences have been removed. Species
381+
whose sequences were removed in steps 4 or 5 will be removed from
382+
the temporary tree given to CodeML.
383+
384+
(Caled by 07_CodeMLonDir.py)
338385

339386
# 08_compileKaKs_CSV.py
340387

@@ -361,26 +408,37 @@ printed to the output directory that you specified with the -o option.
361408

362409
If --phylip is specified, the program will convert the fasta files to phylip
363410
after the stop codons have been removed. If you also specified --codeml,
364-
AlignmentProcessor will edit and submit the control file to codeml and the
411+
AlignmentProcessor will edit and submit the control file to CodeML and the
365412
output files will be saved in 07_codeml. Since there are many different things
366413
that can be done with the codeml output files, AlignmentProcessor does not
367414
attmept to concatenate specific parts from the output files.
368415

369-
If you wish to convert convert the files to both formats, with or without the
370-
--kaks option, specify both --axt and --phylip the program will convert the
371-
fasta files to both formats. You may also run one of the individual scripts on
372-
the 07_rmStops directory to convert the files in a separate step.
373-
AlignmentProcessor will not, however, run KaKs_Calculator and codeml
374-
simultaneously, as this could require too much memory.
416+
If you wish to convert convert the files to both formats, specify both --axt
417+
and --phylip the program will convert the fasta files to both formats. You may
418+
also run one of the individual scripts on the 07_rmStops directory to convert
419+
the files in a separate step. AlignmentProcessor will not, however, run
420+
KaKs_Calculator and CodeML simultaneously, as this could require too much
421+
memory.
375422

376423
#-------------------------------
377424
# 5. Run AlignmentProcessor on test data
378425
#-------------------------------
379426

427+
# To test KaKs_Calculator:
380428
Change directory into the AlignmentProcessor folder. Paste the followig into
381-
a terminal (change the output directory to the desired loaction):
429+
a terminal:
382430

383431
python AlignmentProcessor.py --axt --kaks --ucsc -r Green_anole \
384-
-i test.fa -o test/
432+
-i kaksTest.fa -o test/
385433

386434
This will return a text file with 11 lines.
435+
436+
# To test CodeML:
437+
The test directory already contains sample CodeML control and tree files, so
438+
all you need to do is change into the AlignmentProcessor direcotry and paste
439+
the following:
440+
441+
python AlignmentProcessor.py --phylip --codeml --ucsc -r Green_anole \
442+
-i codemlTest.fa -o test/
443+
444+
There should be 8 .mlc files in the 07_codeml directory.

0 commit comments

Comments
 (0)