Skip to content

Commit e34dc24

Browse files
committed
Streamlined code
1 parent c7ee5bc commit e34dc24

9 files changed

Lines changed: 805 additions & 554 deletions

AlignmentProcessor.py

Lines changed: 282 additions & 302 deletions
Large diffs are not rendered by default.

AlignmentProcessorReadMe.txt

Lines changed: 63 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,14 @@
1212

1313

1414
###############################################################
15-
# AlignmentProcessor0.8 Package
15+
# AlignmentProcessor0.11 Package
1616
#
1717
# Dependencies: Python 3
1818
# Python 3 version of Biopython
19-
# Perl
20-
# PAML
21-
# R
22-
# ape R package
19+
# Perl (if using KaKs_Calculator)
20+
# PAML (if using CodeML)
21+
# R (if using CodeML)
22+
# ape R package (if using CodeML)
2323
###############################################################
2424

2525
### Contents ###
@@ -63,10 +63,10 @@ a terminal and Anaconda will install Biopython for you:
6363

6464
# KaKs_Calculator
6565

66-
AlignmentProcessor0.8 is packaged with KaKs_Calculator2.0 binaries for Linux
66+
AlignmentProcessor0.11 is packaged with KaKs_Calculator2.0 binaries for Linux
6767
and Windows, and a KaKs_Calculator1.2 binary for Mac (there is no 2.0 binary
6868
available for OSX). Before using, copy or move the appropriate binary for your
69-
system into the AlignmentProcessor bin which contains the python scipts.
69+
system into the AlignmentProcessor bin which contains the python scripts.
7070

7171
# PAML 4.8
7272

@@ -97,14 +97,14 @@ This does, unfortunately, limit you to currently available alignments.
9797
If they do have the alignment that you are interested in, however, it will
9898
probably be faster to download it rather than generate a new one. Since this
9999
precludes the use of user-generated alignments, AlignmentProcessor has been
100-
written for Galaxy's Stitch Gene Blocks ouput. If you choose to use UCSC
100+
written for Galaxy's Stitch Gene Blocks output. If you choose to use UCSC
101101
alignments, the sequence headers will have to be converted using the --ucsc
102102
option.
103103

104104
# User Generated Alignments
105105
Since most alignments are in maf format, you will have to convert your
106106
alignment from maf to fasta. There seem to be very few programs that can do
107-
this; fortunately Galaxy's Stich Gene blocks not only converts a maf to fasta,
107+
this; fortunately Galaxy's Stitch Gene blocks not only converts a maf to fasta,
108108
but it also separates sequences by genes, which is something we need to do
109109
anyway.
110110

@@ -123,9 +123,9 @@ the Get Data tab on the left of the screen). Select your species from the UCSC
123123
table browser, select Genes and Gene Predictions, ensembl genes, and the whole
124124
genome. Select BED as the output format and check the send to Galaxy box.
125125

126-
Once you have your data uploaded to Galaxy, select the Stich Gene Blocks tool
126+
Once you have your data uploaded to Galaxy, select the Stitch Gene Blocks tool
127127
under the Fetch Alignments/Sequences tab. Select your reference species'
128-
genome BED file in the Gene BED file dropdown menu. Change MAF Sourse to the
128+
genome BED file in the Gene BED file drop down menu. Change MAF Source to the
129129
maf file you uploaded (you may also use a locally catched alignment if it is
130130
available for all of your species of interest). Select the desired species
131131
IDs, leave Split into Gapless MAF Blocks to "no" (we will deal with gaps
@@ -135,14 +135,14 @@ output file.
135135
This process will take a few hours, so plan accordingly.
136136

137137
Since this method offer the most flexibility for working with alignments,
138-
AlignmentProcessor was wrtien with this output format in mind and no further
138+
AlignmentProcessor was written with this output format in mind and no further
139139
formatting is required.
140140

141141
#-------------------------------
142142
# 2. Running AlignmentProcessor
143143
#-------------------------------
144144

145-
AlignmentProcessor is designed to convert the file into a useable format and
145+
AlignmentProcessor is designed to convert the file into a usable format and
146146
run the substitutions quickly, so everything can be run with one command. Each
147147
script can be run individually if necessary (each script's function and
148148
options will be discussed later). The input order for AlignmentProcessor's
@@ -152,6 +152,14 @@ To execute the AlignmentProcessor pipeline, you must first change into
152152
the package directory. Otherwise it will not be able to locate the scripts
153153
in the bin/ directory.
154154

155+
If you are running CodeML and the program is interrupted, you may call the
156+
07_CodeMLonDir.py script and it will continue where CodeML left off. This
157+
will save the time of having to run those genes through CodeML again (It will
158+
do the same thing if you call the entire pipeline again, but there is no need
159+
to rerun the previous steps). It will not do the same for KaKs_Calculator
160+
since KaKs_Calculator is much faster, so it should not be a problem to just
161+
invoke KaKs_Calculator on the whole directory again.
162+
155163
# Example Usage:
156164

157165
python AlignmentProcessor.py --ucsc --axt/phylip --kaks/codeml \
@@ -168,7 +176,7 @@ in the bin/ directory.
168176

169177
--ucsc This will invoke 00_convertHeader.py, which will convert the
170178
headers from UCSC fasta files so they only contain build
171-
names and gene IDs. This does not need to be run on Stich Gene
179+
names and gene IDs. This does not need to be run on Stitch Gene
172180
Blocks output.
173181

174182
--retainStops This will tell AlignmentProcessor to retain sequences
@@ -177,7 +185,7 @@ in the bin/ directory.
177185
from the analysis as they may bias the results.
178186

179187
--changeNames Tells the program to change genome build names to
180-
commom names (more below).
188+
common names (more below).
181189

182190
-% a decimal value specifying the minimum percentage of reads
183191
that must remain after replacing unknown codons with gaps
@@ -201,7 +209,7 @@ in the bin/ directory.
201209
This file must be titled "codeml.ctl" (the default
202210
name given by PAML).
203211

204-
-n if "--codeml" is selected, you may specify the number of CPUs
212+
-t if "--codeml" is selected, you may specify the number of CPUs
205213
to run CodeML. CodeML itself cannot be parallelized, but
206214
AlignmentProcessor can call multiple instances of CodeML to
207215
shorten overall run time. (Default = 1)
@@ -210,7 +218,7 @@ in the bin/ directory.
210218

211219
-h/--help will print the program's help dialogue
212220

213-
-v/--version will print the program version and copywright info
221+
-v/--version will print the program version and copyright info
214222

215223
--printNameList will print the contents of 02_nameList.txt which
216224
contains the list of genome builds and associated
@@ -237,16 +245,19 @@ in the bin/ directory.
237245
Codeml requires that all of its parameters be specified in one control
238246
file (http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf). Provide a
239247
control file with your desired parameters and AlignmentProcessor will
240-
use it as template. It will only alter the input and output files so
241-
that they are unique for each file. You may also need to provide a
242-
tree file for codeml (see PAML manual above).
248+
use it as template. AlignmentProcessor uses the Biopython codeml module
249+
to more easily edit the names of the input and output files. Because of
250+
this, however, you must make sure that the first three lines (starting
251+
with "seqfile", "treefile", or "outfile" have been removed from the
252+
control file. Otherwise, CodeML will presented only with the data
253+
contained in the control file, and not the whole directory of alingments.
243254

244255
The control file must be titled titled “codeml.ctl”, and it must be
245256
located in the output directory.
246257

247258
# The CodeML tree file
248259

249-
If your CodeML analysis requires a phylogneic tree, provide your
260+
If your CodeML analysis requires a phylogenic tree, provide your
250261
desired tree, titled "codeml.tree" in the output directory. Specify
251262
the tree as you want to appear to CodeML and be sure that the species
252263
names are specified as the common names in the 02_nameList.txt file,
@@ -261,13 +272,13 @@ in the bin/ directory.
261272

262273
# Invoking the Ka/Ks pipeline with a UCSC alignment:
263274

264-
python AlignmentProcessor0.8.py --axt --kaks --ucsc -r green_anole \
275+
python AlignmentProcessor0.11.py --axt --kaks --ucsc -r anoCar2 \
265276
-i anolis_gallus.fa -o pairwiseKaKs/
266277

267278
# Invoking the CodeML pipeline with a de novo alignment:
268279

269-
python AlignmentProcessor0.8.py --phylip --codeml -% 0.6 \
270-
-r green_anole -i anolis_gallus.fa -o codemlOutput/
280+
python AlignmentProcessor0.11.py --phylip --codeml -% 0.6 \
281+
-r anoCar2 -i anolis_gallus.fa -o codemlOutput/
271282

272283
#-------------------------------
273284
# 3. Individual Scripts
@@ -291,7 +302,7 @@ Remember that the order of the arguments does matter for these scripts.
291302

292303
# 01_SplitFastaFiles.py
293304

294-
This script will split the input mult-fasta alignment into one file
305+
This script will split the input multi-fasta alignment into one file
295306
per gene. It will produce an output file for a gene if it has at least
296307
two sequences.
297308

@@ -304,18 +315,18 @@ Remember that the order of the arguments does matter for these scripts.
304315
aligned multiple FASTA files and replace FASTA headers with each
305316
species' common name.
306317

307-
python 02_RemoveHeaderOnDir.py <path to inut and output directories>
318+
python 02_RemoveHeaderOnDir.py <path to input and output directories>
308319

309320
# 03_CheckFrame.py
310321

311322
This script removes gaps introduced in the reference sequence by the
312323
alignment and removes corresponding sites in other species. It assumes
313-
that the reference sequnce was in frame before any gaps were inserted,
324+
that the reference sequence was in frame before any gaps were inserted,
314325
and it returns the reference sequence to its original open reading
315326
frame. It will then replaces codons with missing nucleotides with gaps
316327
to remove unknown amino acids from the sequence.
317328

318-
python 03_CheckFrameOnDir.py <path to inut and output directories> \
329+
python 03_CheckFrameOnDir.py <path to input and output directories> \
319330
<reference_species>
320331

321332
# 04_CountBases.py
@@ -329,13 +340,13 @@ Remember that the order of the arguments does matter for these scripts.
329340
it on its own.
330341

331342
python 05_CountBasesOnDir.py <threshold percentage as a decimal> \
332-
<path to inut and output directories>
343+
<path to input and output directories>
333344

334345
# 05_ReplaceStopCodons.py
335346

336347
This program will remove the internal stop codons (TAA, TAG, TGA)
337348
and replace with gaps (---) from the nucleotide alignment. Some
338-
programs will not run properly if they enounter a premature stop
349+
programs will not run properly if they encounter a premature stop
339350
codon.
340351

341352
Terminal stop codons will be replaced, while sequences with internal
@@ -346,7 +357,7 @@ Remember that the order of the arguments does matter for these scripts.
346357
written to file.
347358

348359
python 05_ReplaceStopCodonsOnDir.py \
349-
<path to inut and output directories> --retainStops(optional)
360+
<path to input and output directories> --retainStops(optional)
350361

351362
# 06_FASTAtoAXT.py
352363

@@ -355,24 +366,24 @@ Remember that the order of the arguments does matter for these scripts.
355366
axt files.
356367

357368
Note: parseFastaIntoAXT.pl was provided by the developers of
358-
KaKs_Calculator and, as such, is the only perl script in this package.
369+
KaKs_Calculator and, as such, is the only Perl script in this package.
359370

360-
python FASTAtoAXTonDirectory.py <path to inut and output directories>
371+
python FASTAtoAXTonDirectory.py <path to input and output directories>
361372

362373
# 06_FASTAtoPhylip.py
363374

364375
This program will convert all files in an input directory
365376
from fasta format to a phylip format.
366377

367378
python 07_FASTAtoPhylip.py <number of species> \
368-
<path to inut and output directories>
379+
<path to input and output directories>
369380

370381
# 07_KaKsonDir.py
371382

372383
This program executes KaKs_Calculator on every file in a directory.
373384

374-
python 07_KaKsonDirectory.py <path to inut and output directories> \
375-
<name of refernce species>
385+
python 07_KaKsonDirectory.py <path to input and output directories> \
386+
<name of reference species>
376387

377388
# 07_CodeMLonDir.py
378389

@@ -381,28 +392,30 @@ Remember that the order of the arguments does matter for these scripts.
381392
codeml. It will overwrite the "seqfile", "treefile", "outfile" lines
382393
include the paths to the input phylip file, the output file, and the
383394
tree file. It will also call the ape R package to trim the tree file
384-
so that it only includes species which have not been filtered out.
395+
so that it only includes species which have not been filtered out. If you
396+
are running CodeML and the program is interrupted, you may invoke this
397+
script to pick up where you left off.
385398

386399
python 07_CodeMLonDir.py <path to codeml control file> \
387-
<path to input and output directories> \
388-
--retainStops(optional)
400+
<path to input and output directories>
389401

390-
3 07_pruneTree.R
402+
# 07_pruneTree.py
391403

392-
This R script will call the ape package to dynamically trim input
393-
trees for CodeML if any sequences have been removed. Species
394-
whose sequences were removed in steps 4 or 5 will be removed from
404+
This script will dynamically trim input trees for CodeML if any sequences
405+
have been removed. Species whose sequences were removed in steps 4 or 5
406+
and are no longer in the phylip alignment will be removed from
395407
the temporary tree given to CodeML.
396408

397-
(Caled by 07_CodeMLonDir.py)
409+
python 07_pruneTree.py <path to input directory> \
410+
<list of species remaining in alignment> <path to tmep output directory>
398411

399-
# 08_compileKaKs_CSV.py
412+
# 08_compileKaKs.py
400413

401414
This script concatonates the output from KaKs_Calculator into a text
402415
file. It adds a column for gene (or sequence) IDs, and prints the gene
403416
ID from the filename.
404417

405-
python compileCSV.py <path to inut and output directories>
418+
python compileCSV.py <path to input and output directories>
406419

407420
#-------------------------------
408421
# 4. Outputs
@@ -424,7 +437,7 @@ after the stop codons have been removed. If you also specified --codeml,
424437
AlignmentProcessor will edit and submit the control file to CodeML and the
425438
output files will be saved in 07_codeml. Since there are many different things
426439
that can be done with the codeml output files, AlignmentProcessor does not
427-
attmept to concatenate specific parts from the output files.
440+
attempt to concatenate specific parts from the output files.
428441

429442
If you wish to convert convert the files to both formats, specify both --axt
430443
and --phylip the program will convert the fasta files to both formats. You may
@@ -438,7 +451,7 @@ memory.
438451
#-------------------------------
439452

440453
# To test KaKs_Calculator:
441-
Change directory into the AlignmentProcessor folder. Paste the followig into
454+
Change directory into the AlignmentProcessor folder. Paste the following into
442455
a terminal:
443456

444457
python AlignmentProcessor.py --axt --kaks --ucsc -r anoCar2 \
@@ -448,10 +461,10 @@ This will return a text file with 11 lines.
448461

449462
# To test CodeML:
450463
The test directory already contains sample CodeML control and tree files, so
451-
all you need to do is change into the AlignmentProcessor direcotry and paste
464+
all you need to do is change into the AlignmentProcessor directory and paste
452465
the following:
453466

454-
python AlignmentProcessor.py --phylip --codeml --ucsc -n 2 -r anoCar2 \
467+
python AlignmentProcessor.py --phylip --codeml --ucsc -t 2 -r anoCar2 \
455468
-i codemlTest.fa -o test/
456469

457470
There should be 8 .mlc files in the 07_codeml directory.

bin/01_SplitFastaFiles.py

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,27 +19,32 @@ def splitFasta(infile, path):
1919
n = 0
2020
for line in fasta:
2121
if line != "\n":
22-
# Concatenate lines for each gene
22+
# Concatenate lines for all species for each gene
2323
seq += str(line)
2424
if line[0] == ">":
2525
# Determine number of sequences and species names
2626
n += 1
2727
if newid == True:
28-
filename = str(line.split(".")[1])
28+
try:
29+
filename = str(line.split(".")[1]).rstrip()
30+
except IndexError:
31+
print(line)
2932
newid = False
3033
elif line == "\n" and newid == False:
34+
# Use empty lines to determine where genes end
3135
if n >= 2:
3236
# Print gene sequences to file if there are at least two
3337
# species and reset for next gene
3438
outfile = (path + "01_splitFastaFiles/" + filename + "."
3539
+ str(n) + ".fa")
3640
with open(outfile, "w") as output:
37-
output.write(seq)
38-
written += 1
41+
output.write(seq)
42+
written += 1
3943
newid = True
4044
seq = ""
4145
n = 0
42-
else:
46+
elif n < 2:
47+
# Record genes with only one sequence
4348
runlog.write(filename + "\n")
4449
excluded += 1
4550
# Write out total number of genes written and excluded

0 commit comments

Comments
 (0)