Skip to content

Commit 4fd144b

Browse files
committed
[FIX] conda environment, better readme
1 parent c964c3f commit 4fd144b

3 files changed

Lines changed: 188 additions & 102 deletions

File tree

README.md

Lines changed: 179 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -5,19 +5,19 @@ FastOMA is a scalable software package to infer orthology relationship.
55
# Input and Output:
66

77
### Input:
8-
1- Sets of protein sequences in FASTA format (with `.fa` extension) in the folder `proteome`.
9-
The name of each fasta file is the name of species. Please make sure that the name of fasta records do not contain special characters including `||`.
8+
1. Sets of protein sequences in FASTA format (with `.fa` extension) in the folder `proteome`.
9+
The name of each fasta file is the name of species. Please make sure that the name of fasta records do not contain special characters including `||`.
1010

11-
12-
2- The omamer database which you can download [this](https://omabrowser.org/All/LUCA-v2.0.0.h5)
13-
which is from [OMA browser](https://omabrowser.org/oma/current/).
14-
This file is `13 Gb` containing all the gene families of the Tree of Life or you can download it for a subset of them, e.g. Primates (352MB).
15-
16-
3- Rooted Species tree in [newick format](http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#reading-newick-trees).
17-
A rough species tree is enough and it does not need to be binary. Besides, we do not need branch lengths.
11+
2. Rooted Species tree in [newick format](http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#reading-newick-trees).
12+
A rough species tree is enough, and it does not need to be binary (fully resolved). Besides, we do not need branch lengths.
1813
Note that the name of leaves of the tree (species name) should be the same as the file name of FASTAs (without `.fa` extension) (item 1).
1914
And there shouldn't be any repeated names in leaves names and internal node names. The tree should not be with quotation.
2015

16+
3. The omamer database which is available for download from the [OMA browser](https://omabrowser.org/oma/current/).
17+
The FastOMA workflow will automatically download the omamer database for LUCA if the argument `--omamer_db` is not
18+
provided on the command line. The argument can be a local file (e.g. a previously downloaded omamer database file) or
19+
a URL to an alternative omamer database, e.g. a subset of the LUCA database which is smaller. However, we recommend
20+
to use the LUCA database if possible.
2121

2222

2323
You can see an example in the [testdata](https://github.com/sinamajidian/FastOMA/tree/master/testdata/in_folder) folder.
@@ -28,35 +28,40 @@ $ cat species_tree.nwk
2828
((AQUAE,CHLTR)inter1,MYCGE)inter2;
2929
```
3030

31-
Besides, the internal node should not contain any special character (e.g. `\` `/` or space).
31+
Besides, the internal node should not contain any special character (e.g. `\` `/` or `space`).
3232
The reason is that FastOMA write some files whose names contain the internal node's name.
3333
If the species tree does not have label for some/all internal nodes, FastOMA labels them sequentially.
3434

3535

36-
### Input check:
37-
38-
After installing FastOMA, you can have a initial check for your input dataset by running the following in the folder `in_folder`:
39-
40-
```
41-
cd in_folder
42-
check-fastoma-input
43-
```
44-
45-
4636

4737
### Main output:
48-
Orthology information as HOG strcutre in [OrthoXML](https://orthoxml.org/) format
38+
Orthology information as HOG structure in [OrthoXML](https://orthoxml.org/) format
4939
which can be used with [PyHAM](https://github.com/DessimozLab/pyham).
50-
The details of output are described [below](https://github.com/DessimozLab/FastOMA#expected-output-structure-for-test-data).
40+
The details of output are described [below](#expected-output-structure-for-test-data).
41+
42+
Additionally, FastOMA generates TSV files for rootlevel HOGs (deepest level) and
43+
marker genes groups (one gene per species maximum) together with dumps of fasta
44+
files (one per rootlevel HOG / marker gene).
5145

5246

5347
# How to run FastOMA
54-
In summary, you need to 1) install FastOMA and its prerequisites (below), and 2) put the input files in the folder `in_folder`
55-
and 3) run FastOMA using the nextflow recipe `FastOMA.nf`.
56-
```
57-
nextflow run FastOMA.nf -profile docker --input_folder /path/to/in_folder --output_folder /path/to/out_folder
48+
49+
FastOMA is implemented as a [nextflow-workflow](https://www.nextflow.io/). As such, FastOMA can be run without
50+
any installation steps given the system supports running either docker containers, singularity containers or has conda
51+
installed.
52+
53+
```bash
54+
nextflow run dessimozlab/FastOMA -profile docker --input_folder /path/to/in_folder --output_folder /path/to/out_folder
5855
```
59-
The script `FastOMA.nf` is tailored for a few species. To run FastOMA with hundreds of species, please use `FastOMA.nf`.
56+
57+
Nextflow will automatically fetch the [dessimozlab/FastOMA](https://github.com/dessimozlab/FastOMA) repository and starts
58+
the `FastOMA.nf` workflow. The `-profile` argument must be used to specify the profile to use. We support `docker`,
59+
`singularity` and `conda` which then automatically set up the necessary tools by downloading the required containers or creating
60+
a conda environment with the necessary dependencies.
61+
62+
See also [How to install FastOMA](#how-to-install-FastOMA) for additional ways how to install and run FastOMA. Note also the
63+
section on the different [profiles](#using-different-nextflow-profiles).
64+
6065

6166
## More details on how to run
6267
We provide for every commit of the repository a docker image for FastOMA on dockerhub. You can specify the container as
@@ -71,65 +76,101 @@ nextflow run FastOMA.nf -profile docker \
7176
```
7277

7378

74-
7579
# How to install FastOMA
7680

77-
## prerequisites
81+
## Running workflow directly
7882

79-
First, we create a fresh [conda](https://docs.conda.io/en/latest/miniconda.html) environment.
80-
```
81-
conda create --name FastOMA python=3.9
82-
conda activate FastOMA
83-
python -m pip install --upgrade pip
84-
```
85-
You may use conda to install [fasttree](http://www.microbesonline.org/fasttree/), [mafft](http://mafft.cbrc.jp/alignment/software/). and [openjdk](https://jdk.java.net/java-se-ri/17) (the alternative for Java 11< version <17 which is needed for nextflow).
86-
```
87-
conda install -c bioconda mafft fasttree
88-
conda install -c conda-forge openjdk=16
89-
```
83+
The FastOMA workflow can be run directly using nextflow's ability to fetch a workflow from github. A specific version
84+
can be selected by specifying the `-r` option to nextflow to select a specific version of FastOMA:
9085

91-
## How to install FastOMA
92-
First, download the FastOMA package:
86+
```bash
87+
nextflow run desimozlab/FastOMA -r 0.2.0 -profile conda
9388
```
94-
wget https://github.com/DessimozLab/FastOMA/archive/refs/heads/main.zip
95-
unzip main.zip
96-
mv FastOMA-main FastOMA
89+
90+
This will fetch version 0.2.0 from github and run the FastOMA workflow using the conda profile.
91+
92+
## Cloning the FastOMA repo and running from there
93+
94+
```bash
95+
git clone https://github.com/DessimozLab/FastOMA.git
96+
cd FastOMA
97+
nextflow run FastOMA.nf -profile docker --container_version "sha-$(git rev-list --max-count=1 --abbrev-commit HEAD)" ...
9798
```
98-
Then install it
99+
100+
## Manual installation (for development) in python virtual environment
101+
102+
- install [mafft](https://mafft.cbrc.jp/alignment/software) and [FastTree](http://www.microbesonline.org/fasttree/) and ensure the software is accessible on the PATH.
103+
- install python >= 3.9
104+
- create virtual environment, activate it and install FastOMA with additional extras inside it:
105+
```bash
106+
python3 -m venv .venv
107+
source .venv/bin/activate
108+
pip install FastOMA[report,nextflow]
109+
```
110+
You can also install FastOMA from a clone of the repository in editable mode with `pip install -e .[report,nextflow]`.
111+
112+
- run pipeline including with some testdata:
113+
```bash
114+
nextflow run FastOMA.nf -profile standard --input_folder testdata/in_folder --output_folder output -with-report
115+
```
116+
117+
118+
## Manual installation in conda/mamba environment
119+
In the FastOMA repository, we provide a conda environment file that can be used to generate a conda / mamba
120+
environment:
99121
```
100-
ls FastOMA/setup.py
101-
python -m pip install -e FastOMA
122+
git clone https://github.com/DessimozLab/FastOMA.git
123+
124+
mamba env create -n FastOMA -f environment_conda.yml
125+
mamba activate FastOMA
102126
```
103127

104-
The output would be
128+
Afterwards, you can run the workflow using nextflow (which is installed as part of the conda environment)
129+
105130
```
106-
...
107-
Running setup.py develop for FastOMA
108-
Successfully installed Cython-3.0.1 DendroPy-4.6.1 biopython-1.81 blosc2-2.0.0 ete3-3.1.3 future-0.18.3 humanfriendly-10.0 llvmlite-0.40.1 lxml-4.9.3 msgpack-1.0.5 nextflow-23.4.3 numba-0.57.1 numexpr-2.8.5 numpy-1.24.4 omamer-0.2.6 packaging-23.1 pandas-2.0.3 property-manager-3.0 py-cpuinfo-9.0.0 pyparsing-3.1.1 pysais-1.1.0 python-dateutil-2.8.2 pytz-2023.3 scipy-1.11.2 six-1.16.0 tables-3.8.0 tqdm-4.66.1 tzdata-2023.3 verboselogs-1.7
109-
FastOMA-0.0.6
131+
nextflow run FastOMA.nf -profile standard|slurm --input_folder /path/to/input_folder --output_folder /path/to/output
110132
```
111133

112-
You can check your installation with running one of submodules of FastOMa
113-
```
114-
fastoma-infer-roothogs --version
115-
```
134+
not that you should use either the profile `standard` or `slurm` such the nextflow executor will use the activated environment.
116135

117-
You can make sure that omamer and nextflow is installed with running
118-
```
119-
omamer -h
120-
nextflow -h
121-
```
136+
# Using different nextflow profiles
137+
138+
Nextflow provides support to run a workflow on different infrastructures. Selection of this is done using the `-profile` argument.
139+
For FastOMA, we've implemented the following profiles below. Additional ones can also be created by specifying them in the `nextflow.config` file.
140+
141+
## Docker
142+
With `-profile docker` one can use docker as an execution platform. It requires docker to be installed on the system. The pipeline
143+
will automatically fetch missing containers from dockerhub (e.g. dessimozlab/fastoma) if not found locally. By default, the version
144+
`latest` is used by the pipeline, however we provide images for any branch and release as well; even for every recent commit.
145+
One can select the desired container via the `--container_version` argument
122146

123-
If it doesn't work, you may need to have the following for nextflow to work.
124147
```
125-
JAVA_HOME="/path/to/jdk-17"
126-
NXF_JAVA_HOME="/path/to/jdk-17"
127-
export PATH="/path/to/jdk-17/bin:$PATH"
148+
nextflow run FastOMA.nf -profile docker \
149+
--container_version "sha-$(git rev-list --max-count=1 --abbrev-commit HEAD)" \
150+
--input_folder testdata/in_folder \
151+
--output_folder myresult/
128152
```
129-
You can always make sure whether you are using the python that you intended to use with `which python` and `which python3`.
130-
If you face any difficulty during installation, feel free to create a [github issue](https://github.com/DessimozLab/FastOMA/issues), we'll try to solve it toghter.
153+
This will use the container that is tagged with the current commit id. Similarly, one could also use
154+
`--container_version "0.2.0"` to use the container with version `dessimozlab/fastoma:0.2.0` from dockerhub.
131155

156+
## Singularity
157+
With `-profile singularity` singularity containers will be used to run the workflow. It requires singularity to
158+
be installed on your system. The containers are automatically pulled from dockerhub and converted to singularity
159+
containers. The same options as for [Docker](#docker) will be available.
132160

161+
## Conda
162+
with `-profile conda`, the FastOMA workflow will create a conda environment which contains the necessary
163+
dependencies and use this environment to run the workflow steps. Note that this environment does not need
164+
to be activated manually. If you prefer to install the dependencies inside a conda or mamba environment
165+
yourself, this can be achieved as described in [](#manual-installation-for-development-in-python-virtual-environment).
166+
167+
## Slurm (with singularity/conda)
168+
On a HPC system you typically run processes using a scheduler system such as slurm or LSF. We provide
169+
profiles `-profile slurm`, `-profile slurm_singularity` and `-profile slurm_conda` to run FastOMA with
170+
the respective engine using [slurm](https://slurm.schedmd.com/overview.html) as a scheduler system.
171+
If you need a different scheduler, it is quite straight forward to
172+
set it up in `nextflow.config` based on the existing profiles and the documentation of
173+
[nextflow executors](https://www.nextflow.io/docs/latest/executor.html).
133174

134175
# How to run FastOMA on the test data
135176
Then, cd to the `testdata` folder and download the omamer database and change its name to `omamerdb.h5`.
@@ -154,11 +195,13 @@ $ tree ../testdata/in_folder
154195
Finally, run the package using nextflow as below:
155196
```
156197
# cd FastOMA/testdata
157-
nextflow ../FastOMA.nf --input_folder in_folder --output_folder out_folder -with-report
198+
nextflow run ../FastOMA.nf \
199+
--input_folder in_folder \
200+
--omamer_db testdata/in_folder/omamer_db.h5 \
201+
--output_folder out_folder \
202+
--report \
203+
-profile standard
158204
```
159-
The script `FastOMA.nf` is tailored for a few species. In real case scenario, please use `FastOMA.nf`.
160-
The only difference between these two scripts is the amount of CPU and memory assigned to each job.
161-
162205

163206
Note that to have a comprehensive test, we set the default value of needed cpus as 10.
164207

@@ -177,54 +220,89 @@ After few minutes, the run for test data finishes.
177220
The first step is to run [OMAmer](https://github.com/DessimozLab/omamer) for finding the putative gene families (putative rootHOG) based on kmer similarity.
178221
Next, we write them in FASTA files, which could be used to run next steps in parrallel on each FASTA gene family.
179222
Then, to have similar size jobs, we batch these FASTA files either as one big roothog (per job `hog_big`) or a few hundreds together as one job `hog_rest`.
180-
These are decided based on the FASTA file size. Finally once all jobs of `hog_big` and `hog_rest` are done, we `collect_subhog` and save all outputs.
181-
223+
These are decided based on the FASTA file size. Finally, once all jobs of `hog_big` and `hog_rest` are done, we `collect_subhog` and save all outputs.
182224

183225
If the run interrupted, by adding `-resume` to the nextflow commond line, you might be able to continue your previous nextflow job.
184226

185227

186228
## expected output structure for test data
187229

188-
The output of FastOMA includes four files
189-
(`OrthologousGroupsFasta.tsv`, `rootHOGs.tsv`, `output_hog.orthoxml` and `species_tree_checked.nwk`) and four folders
190-
(`hogmap`, `OrthologousGroupsFasta`, `temp_pickles` and `temp_output`).
230+
The output of FastOMA includes several output files regarding orthology inference
231+
(`OrthologousGroups.tsv`, `RootHOGs.tsv`, `FastOMA_HOGs.orthoxml`, `orthologs.tsv.gz` and `species_tree_checked.nwk`),
232+
a jupyter notebook based report about the dataset (`report.ipynb` and `report.html`) and four folders
233+
(`hogmap`, `OrthologousGroupsFasta`, `RootHOGsFasta` and `stats`).
191234

192235
The `hogmap` folder includes the output of [OMAmer](https://github.com/DessimozLab/omamer); each file corresponds to an input proteome.
193236
The folder `OrthologousGroupsFasta` includes FASTA files, and all proteins inside each FASTA file are orthologous to each other.
194237
These could be used as gene markers for species tree inference with refined resolution, [more info](https://f1000research.com/articles/9-511).
195-
Note that Orthologous Groups are groups of strict orthologs, with at most 1 representative per species.
196-
Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level.
238+
Note that OrthologousGroups are groups of strict orthologs, with at most 1 representative per species.
239+
Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level. The file
240+
`FastOMA_HOGs.orthoxml` contains all the nested groups in orthoxml format. The `RootHOGs.tsv` and `RootHOGsFasta/` files contains
241+
the groups at the deepest level.
197242

198243
So, following files and folders should appear in the folder `out_folder` which was the argument.
199244
```
200-
$ls out_folder
201-
hogmap OrthologousGroupsFasta OrthologousGroups.tsv output_hog.orthoxml rootHOGs.tsv species_tree_checked.nwk temp_output temp_pickles
202-
```
203-
among which `output_hog.orthoxml` is the final output in [orthoXML format](https://orthoxml.org/0.4/orthoxml_doc_v0.4.html). Its content looks like this
204-
205-
```
206-
<?xml version="1.0" ?>
207-
<orthoXML xmlns="http://orthoXML.org/2011/" origin="OMA" originVersion="Nov 2021" version="0.3">
208-
<species name="MYCGE" NCBITaxId="1">
209-
<database name="QFO database " version="2020">
210-
<genes>
211-
<gene id="1000000000" protId="sp|P47500|RF1_MYCGE"/>
212-
<gene id="1000000001" protId="sp|P13927|EFTU_MYCGE"/>
213-
<gene id="1000000002" protId="sp|P47639|ATPB_MYCGE"/>
245+
$tree out_folder
246+
├── FastOMA_HOGs.orthoxml
247+
├── hogmap
248+
│   ├── AQUAE.fa.hogmap
249+
│   ├── CHLTR.fa.hogmap
250+
│   └── MYCGE.fa.hogmap
251+
├── OrthologousGroupsFasta
252+
│   ├── OG_0000001.fa.gz
253+
│   ├── OG_0000002.fa.gz
254+
│   ├── OG_0000003.fa.gz
255+
│ ├ ...
256+
├── OrthologousGroups.tsv
257+
├── orthologs.tsv.gz
258+
├── phylostratigraphy.html
259+
├── report.html
260+
├── report.ipynb
261+
├── RootHOGsFasta
262+
│   ├── HOG:0000001.fa.gz
263+
│   ├── HOG:0000002.fa.gz
264+
│   ├── HOG:0000003.fa.gz
265+
│   ├ ...
266+
├── RootHOGs.tsv
267+
├── species_tree_checked.nwk
268+
└── stats
269+
├── pipeline_dag_<date>.html
270+
├── report_<date>.html
271+
├── timeline_<date>.html
272+
└── trace_<date>.txt
273+
```
274+
among which `FastOMA_HOGs.orthoxml` is the final output in [orthoXML format](https://orthoxml.org/0.4/orthoxml_doc_v0.4.html). Its content looks like this
275+
276+
```
277+
<?xml version='1.0' encoding='utf-8'?>
278+
<orthoXML xmlns="http://orthoXML.org/2011/" origin="FastOMA 0.1.6" originVersion="2024-01-10 17:36:45" version="0.5">
279+
<species name="MYCGE" taxonId="5" NCBITaxId="0">
280+
<database name="database" version="2023">
281+
<genes>
282+
<gene id="1000000001" protId="sp|P47500|RF1_MYCGE" />
283+
<gene id="1000000002" protId="sp|P13927|EFTU_MYCGE" />
284+
<gene id="1000000003" protId="sp|P47639|ATPB_MYCGE" />
214285
215286
...
216-
<orthologGroup id="HOG:B0885011_sub10003">
217-
<property name="TaxRange" value="inter1"/>
218-
<geneRef id="1002000004"/>
219-
<geneRef id="1001000004"/>
287+
<orthologGroup id="HOG:0000001_1" taxonId="1">
288+
<score id="CompletenessScore" value="1.0" />
289+
<property name="OMAmerRootHOG" value="HOG:D0900115" />
290+
<property name="TaxRange" value="inter2" />
291+
<geneRef id="1000000005" />
292+
<orthologGroup id="HOG:0000001_2" taxonId="2">
293+
<score id="CompletenessScore" value="1.0" />
294+
<property name="TaxRange" value="inter1" />
295+
<geneRef id="1002000010" />
296+
<geneRef id="1001000009" />
220297
</orthologGroup>
221-
</groups>
298+
</orthologGroup>
299+
</groups>
222300
</orthoXML>
223301
```
224302

225303
If you are interested in specific gene in specific species, and wants to know
226-
proteins that are in the gene family, you can find its protein ID in the file `rootHOGs.tsv` using grep.
227-
The first column of this file `rootHOGs.tsv` shows the rootHOG ID which could be searched on the [OMA browser](https://omabrowser.org/).
304+
proteins that are in the gene family, you can find its protein ID in the file `RootHOGs.tsv` using grep.
305+
The first column of this file `RootHOGs.tsv` shows the rootHOG ID which could be searched on the [OMA browser](https://omabrowser.org/).
228306
Note that some of the input genes might not appear in this file.
229307

230308
To find list of genes that are orthologous to your gene of interest, you can search in the file `OrthologousGroups.tsv`
@@ -329,6 +407,7 @@ These are initial gene families that are used in `infer_subhogs` step, which cou
329407

330408

331409
## Change log
410+
- Update v0.1.6: adding dynamic resources, additional and improved output
332411
- Update v0.1.5: docker, add help, clean nextflow
333412
- Update v0.1.4: new gene families with linclust if mmseqs is installed, using quoted protein name to handle species chars, check input first
334413
- Update v0.1.3: merge rootHOGs and handle singleton using omamer multi-hits

environment-conda.yml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,13 @@ dependencies:
77
- omamer
88
- mafft
99
- fasttree
10+
- nextflow
11+
- papermill
12+
- seaborn
13+
- matplotlib
14+
- pyparsing
15+
- networkx
16+
- jupyter
1017
- pip
1118
- pip:
12-
- .
19+
- .[report]

0 commit comments

Comments
 (0)