You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -5,19 +5,19 @@ FastOMA is a scalable software package to infer orthology relationship.
5
5
# Input and Output:
6
6
7
7
### Input:
8
-
1- Sets of protein sequences in FASTA format (with `.fa` extension) in the folder `proteome`.
9
-
The name of each fasta file is the name of species. Please make sure that the name of fasta records do not contain special characters including `||`.
8
+
1. Sets of protein sequences in FASTA format (with `.fa` extension) in the folder `proteome`.
9
+
The name of each fasta file is the name of species. Please make sure that the name of fasta records do not contain special characters including `||`.
10
10
11
-
12
-
2- The omamer database which you can download [this](https://omabrowser.org/All/LUCA-v2.0.0.h5)
13
-
which is from [OMA browser](https://omabrowser.org/oma/current/).
14
-
This file is `13 Gb` containing all the gene families of the Tree of Life or you can download it for a subset of them, e.g. Primates (352MB).
15
-
16
-
3- Rooted Species tree in [newick format](http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#reading-newick-trees).
17
-
A rough species tree is enough and it does not need to be binary. Besides, we do not need branch lengths.
11
+
2. Rooted Species tree in [newick format](http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#reading-newick-trees).
12
+
A rough species tree is enough, and it does not need to be binary (fully resolved). Besides, we do not need branch lengths.
18
13
Note that the name of leaves of the tree (species name) should be the same as the file name of FASTAs (without `.fa` extension) (item 1).
19
14
And there shouldn't be any repeated names in leaves names and internal node names. The tree should not be with quotation.
20
15
16
+
3. The omamer database which is available for download from the [OMA browser](https://omabrowser.org/oma/current/).
17
+
The FastOMA workflow will automatically download the omamer database for LUCA if the argument `--omamer_db` is not
18
+
provided on the command line. The argument can be a local file (e.g. a previously downloaded omamer database file) or
19
+
a URL to an alternative omamer database, e.g. a subset of the LUCA database which is smaller. However, we recommend
20
+
to use the LUCA database if possible.
21
21
22
22
23
23
You can see an example in the [testdata](https://github.com/sinamajidian/FastOMA/tree/master/testdata/in_folder) folder.
@@ -28,35 +28,40 @@ $ cat species_tree.nwk
28
28
((AQUAE,CHLTR)inter1,MYCGE)inter2;
29
29
```
30
30
31
-
Besides, the internal node should not contain any special character (e.g. `\``/` or space).
31
+
Besides, the internal node should not contain any special character (e.g. `\``/` or `space`).
32
32
The reason is that FastOMA write some files whose names contain the internal node's name.
33
33
If the species tree does not have label for some/all internal nodes, FastOMA labels them sequentially.
34
34
35
35
36
-
### Input check:
37
-
38
-
After installing FastOMA, you can have a initial check for your input dataset by running the following in the folder `in_folder`:
39
-
40
-
```
41
-
cd in_folder
42
-
check-fastoma-input
43
-
```
44
-
45
-
46
36
47
37
### Main output:
48
-
Orthology information as HOG strcutre in [OrthoXML](https://orthoxml.org/) format
38
+
Orthology information as HOG structure in [OrthoXML](https://orthoxml.org/) format
49
39
which can be used with [PyHAM](https://github.com/DessimozLab/pyham).
50
-
The details of output are described [below](https://github.com/DessimozLab/FastOMA#expected-output-structure-for-test-data).
40
+
The details of output are described [below](#expected-output-structure-for-test-data).
41
+
42
+
Additionally, FastOMA generates TSV files for rootlevel HOGs (deepest level) and
43
+
marker genes groups (one gene per species maximum) together with dumps of fasta
44
+
files (one per rootlevel HOG / marker gene).
51
45
52
46
53
47
# How to run FastOMA
54
-
In summary, you need to 1) install FastOMA and its prerequisites (below), and 2) put the input files in the folder `in_folder`
55
-
and 3) run FastOMA using the nextflow recipe `FastOMA.nf`.
56
-
```
57
-
nextflow run FastOMA.nf -profile docker --input_folder /path/to/in_folder --output_folder /path/to/out_folder
48
+
49
+
FastOMA is implemented as a [nextflow-workflow](https://www.nextflow.io/). As such, FastOMA can be run without
50
+
any installation steps given the system supports running either docker containers, singularity containers or has conda
51
+
installed.
52
+
53
+
```bash
54
+
nextflow run dessimozlab/FastOMA -profile docker --input_folder /path/to/in_folder --output_folder /path/to/out_folder
58
55
```
59
-
The script `FastOMA.nf` is tailored for a few species. To run FastOMA with hundreds of species, please use `FastOMA.nf`.
56
+
57
+
Nextflow will automatically fetch the [dessimozlab/FastOMA](https://github.com/dessimozlab/FastOMA) repository and starts
58
+
the `FastOMA.nf` workflow. The `-profile` argument must be used to specify the profile to use. We support `docker`,
59
+
`singularity` and `conda` which then automatically set up the necessary tools by downloading the required containers or creating
60
+
a conda environment with the necessary dependencies.
61
+
62
+
See also [How to install FastOMA](#how-to-install-FastOMA) for additional ways how to install and run FastOMA. Note also the
63
+
section on the different [profiles](#using-different-nextflow-profiles).
64
+
60
65
61
66
## More details on how to run
62
67
We provide for every commit of the repository a docker image for FastOMA on dockerhub. You can specify the container as
@@ -71,65 +76,101 @@ nextflow run FastOMA.nf -profile docker \
71
76
```
72
77
73
78
74
-
75
79
# How to install FastOMA
76
80
77
-
## prerequisites
81
+
## Running workflow directly
78
82
79
-
First, we create a fresh [conda](https://docs.conda.io/en/latest/miniconda.html) environment.
80
-
```
81
-
conda create --name FastOMA python=3.9
82
-
conda activate FastOMA
83
-
python -m pip install --upgrade pip
84
-
```
85
-
You may use conda to install [fasttree](http://www.microbesonline.org/fasttree/), [mafft](http://mafft.cbrc.jp/alignment/software/). and [openjdk](https://jdk.java.net/java-se-ri/17) (the alternative for Java 11< version <17 which is needed for nextflow).
86
-
```
87
-
conda install -c bioconda mafft fasttree
88
-
conda install -c conda-forge openjdk=16
89
-
```
83
+
The FastOMA workflow can be run directly using nextflow's ability to fetch a workflow from github. A specific version
84
+
can be selected by specifying the `-r` option to nextflow to select a specific version of FastOMA:
90
85
91
-
## How to install FastOMA
92
-
First, download the FastOMA package:
86
+
```bash
87
+
nextflow run desimozlab/FastOMA -r 0.2.0 -profile conda
## Manual installation (for development) in python virtual environment
101
+
102
+
- install [mafft](https://mafft.cbrc.jp/alignment/software) and [FastTree](http://www.microbesonline.org/fasttree/) and ensure the software is accessible on the PATH.
103
+
- install python >= 3.9
104
+
- create virtual environment, activate it and install FastOMA with additional extras inside it:
105
+
```bash
106
+
python3 -m venv .venv
107
+
source .venv/bin/activate
108
+
pip install FastOMA[report,nextflow]
109
+
```
110
+
You can also install FastOMA from a clone of the repository in editable mode with `pip install -e .[report,nextflow]`.
111
+
112
+
- run pipeline including with some testdata:
113
+
```bash
114
+
nextflow run FastOMA.nf -profile standard --input_folder testdata/in_folder --output_folder output -with-report
115
+
```
116
+
117
+
118
+
## Manual installation in conda/mamba environment
119
+
In the FastOMA repository, we provide a conda environment file that can be used to generate a conda / mamba
You can always make sure whether you are using the python that you intended to use with `which python` and `which python3`.
130
-
If you face any difficulty during installation, feel free to create a [github issue](https://github.com/DessimozLab/FastOMA/issues), we'll try to solve it toghter.
153
+
This will use the container that is tagged with the current commit id. Similarly, one could also use
154
+
`--container_version "0.2.0"` to use the container with version `dessimozlab/fastoma:0.2.0` from dockerhub.
131
155
156
+
## Singularity
157
+
With `-profile singularity` singularity containers will be used to run the workflow. It requires singularity to
158
+
be installed on your system. The containers are automatically pulled from dockerhub and converted to singularity
159
+
containers. The same options as for [Docker](#docker) will be available.
132
160
161
+
## Conda
162
+
with `-profile conda`, the FastOMA workflow will create a conda environment which contains the necessary
163
+
dependencies and use this environment to run the workflow steps. Note that this environment does not need
164
+
to be activated manually. If you prefer to install the dependencies inside a conda or mamba environment
165
+
yourself, this can be achieved as described in [](#manual-installation-for-development-in-python-virtual-environment).
166
+
167
+
## Slurm (with singularity/conda)
168
+
On a HPC system you typically run processes using a scheduler system such as slurm or LSF. We provide
169
+
profiles `-profile slurm`, `-profile slurm_singularity` and `-profile slurm_conda` to run FastOMA with
170
+
the respective engine using [slurm](https://slurm.schedmd.com/overview.html) as a scheduler system.
171
+
If you need a different scheduler, it is quite straight forward to
172
+
set it up in `nextflow.config` based on the existing profiles and the documentation of
The script `FastOMA.nf` is tailored for a few species. In real case scenario, please use `FastOMA.nf`.
160
-
The only difference between these two scripts is the amount of CPU and memory assigned to each job.
161
-
162
205
163
206
Note that to have a comprehensive test, we set the default value of needed cpus as 10.
164
207
@@ -177,54 +220,89 @@ After few minutes, the run for test data finishes.
177
220
The first step is to run [OMAmer](https://github.com/DessimozLab/omamer) for finding the putative gene families (putative rootHOG) based on kmer similarity.
178
221
Next, we write them in FASTA files, which could be used to run next steps in parrallel on each FASTA gene family.
179
222
Then, to have similar size jobs, we batch these FASTA files either as one big roothog (per job `hog_big`) or a few hundreds together as one job `hog_rest`.
180
-
These are decided based on the FASTA file size. Finally once all jobs of `hog_big` and `hog_rest` are done, we `collect_subhog` and save all outputs.
181
-
223
+
These are decided based on the FASTA file size. Finally, once all jobs of `hog_big` and `hog_rest` are done, we `collect_subhog` and save all outputs.
182
224
183
225
If the run interrupted, by adding `-resume` to the nextflow commond line, you might be able to continue your previous nextflow job.
184
226
185
227
186
228
## expected output structure for test data
187
229
188
-
The output of FastOMA includes four files
189
-
(`OrthologousGroupsFasta.tsv`, `rootHOGs.tsv`, `output_hog.orthoxml` and `species_tree_checked.nwk`) and four folders
190
-
(`hogmap`, `OrthologousGroupsFasta`, `temp_pickles` and `temp_output`).
230
+
The output of FastOMA includes several output files regarding orthology inference
231
+
(`OrthologousGroups.tsv`, `RootHOGs.tsv`, `FastOMA_HOGs.orthoxml`, `orthologs.tsv.gz` and `species_tree_checked.nwk`),
232
+
a jupyter notebook based report about the dataset (`report.ipynb` and `report.html`) and four folders
233
+
(`hogmap`, `OrthologousGroupsFasta`, `RootHOGsFasta` and `stats`).
191
234
192
235
The `hogmap` folder includes the output of [OMAmer](https://github.com/DessimozLab/omamer); each file corresponds to an input proteome.
193
236
The folder `OrthologousGroupsFasta` includes FASTA files, and all proteins inside each FASTA file are orthologous to each other.
194
237
These could be used as gene markers for species tree inference with refined resolution, [more info](https://f1000research.com/articles/9-511).
195
-
Note that Orthologous Groups are groups of strict orthologs, with at most 1 representative per species.
196
-
Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level.
238
+
Note that OrthologousGroups are groups of strict orthologs, with at most 1 representative per species.
239
+
Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level. The file
240
+
`FastOMA_HOGs.orthoxml` contains all the nested groups in orthoxml format. The `RootHOGs.tsv` and `RootHOGsFasta/` files contains
241
+
the groups at the deepest level.
197
242
198
243
So, following files and folders should appear in the folder `out_folder` which was the argument.
among which `output_hog.orthoxml` is the final output in [orthoXML format](https://orthoxml.org/0.4/orthoxml_doc_v0.4.html). Its content looks like this
among which `FastOMA_HOGs.orthoxml` is the final output in [orthoXML format](https://orthoxml.org/0.4/orthoxml_doc_v0.4.html). Its content looks like this
0 commit comments