Skip to content

Commit af12a55

Browse files
author
Sven Twardziok
committed
add ga4gh apis description
1 parent b56252b commit af12a55

1 file changed

Lines changed: 148 additions & 0 deletions

File tree

docs/general/ga4gh_cloud.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# GA4GH Cloud Workstream
2+
3+
The GA4GH (Global Alliance for Genomics and Health) cloud [APIs][ga4gh-cloud]
4+
are a set of standard APIs that provide a common interface for accessing
5+
genomic data and tools across different cloud providers. These APIs are
6+
essential for enabling genomic data sharing and collaboration, and they have
7+
been adopted by major cloud providers such as Google Cloud Platform, Microsoft
8+
Azure, and Amazon Web Services. In this documentation, we'll cover four main
9+
GA4GH APIs that you'll be using: the Workflow Execution Service
10+
([WES][ga4gh-wes]), the Task Execution Service ([TES][ga4gh-tes]), the Data
11+
Repository Service ([DRS][ga4gh-drs]), and the Tool Registry Service
12+
([TRS][ga4gh-trs]). The WES API allows you to define and execute workflows,
13+
while the TES API allows you to execute individual tasks within those
14+
workflows. The DRS API provides a way to access and download genomic data, and
15+
the TRS API enables the discovery of genomic analysis tools.
16+
17+
Whether you are a bioinformatician or a data scientist, this site will
18+
provide you with all the information you need to start using ELIXIR's GA4GH
19+
cloud services ecosystem and harness the power of cloud computing for your
20+
genomic data analysis needs. Let's get started!
21+
22+
## Task Execution Service (TES)
23+
24+
The GA4GH [TES][ga4gh-tes] specification is a standard interface that enables
25+
interoperability between workflow management systems and execution engines. The
26+
TES specification provides a uniform way to submit and monitor tasks to any
27+
execution engine that implements the specification, allowing users to easily
28+
switch between workflow management systems or execution engines without
29+
rewriting their workflows. Typical use cases are
30+
31+
- Scenario 1: A researcher wants to run a workflow locally. The workflow
32+
contains some resource-intensive steps, such as requirements for GPUs or many
33+
cores. Using TES as a backend, the researcher can execute the workflow
34+
locally and also send the resource-intensive tasks to cloud servers for
35+
execution.
36+
- Scenario 2: A researcher wants to run a workflow that involves processing
37+
data that is stored in cloud locations. Using TES would allow individual
38+
tasks to be sent to the compute locations associated with each storage
39+
location. This may be relevant if the data provider does not allow files to
40+
be downloaded to a central location or if it is not technically feasible.
41+
42+
The TES specification defines a HTTP API for submitting and monitoring tasks
43+
that includes endpoints for creating, querying, updating, and canceling tasks.
44+
Tasks are defined as JSON objects that include input and output files, a
45+
command to execute, and any environment variables or resources required by the
46+
task. The TES specification also includes mechanisms for handling task
47+
dependencies and retrying failed tasks. Popular TES implementations are
48+
[Funnel][funnel] and [TESK][tesk].
49+
50+
Several popular workflow management systems, including [cwl-tes][cwl-tes],
51+
[Snakemake][snakemake] and [Nextflow][nextflow], have implemented the TES
52+
specification, allowing users to easily run their workflows on any execution
53+
engine that supports TES.
54+
55+
### Snakemake
56+
57+
Snakemake supports TES v1.0 since version 5.28.0, as described in the
58+
[Snakemake documentation][snakemake-docs]. Snakemake executes individual tasks
59+
as separate workflows that execute only one or a few rules. When using TES, it
60+
is recommended to use additional remote storage to store input and output
61+
files. By default, Snakemake TES tasks are executed using the official
62+
Snakemake container image in the same version as the original Snakemake call.
63+
To use specific tools, conda environments should be appended to the rules. A
64+
demo workflow is available
65+
[here][elixir-cloud-demo-smk].
66+
67+
### CWL-tes
68+
69+
A demo workflow is available [here][elixir-cloud-demo-cwl].
70+
71+
### Nextflow
72+
73+
You can find an article about NextFlow with GA4GH TES [here](https://techcommunity.microsoft.com/blog/healthcareandlifesciencesblog/introducing-nextflow-with-ga4gh-tes-a-new-era-of-scalable-data-processing-on-azu/4253160)
74+
75+
To use TES in your Nextflow config, use the plugin `nf-ga4gh`:
76+
77+
```
78+
plugins {
79+
id 'nf-ga4gh'
80+
}
81+
```
82+
83+
## Workflow Execution Service (WES)
84+
85+
The GA4GH [WES][ga4gh-wes] is a standard specification protocol for executing
86+
and monitoring bioinformatics workflows. It allows researchers to easily
87+
execute and manage complex analysis pipelines across multiple computing
88+
platforms and institutions. The WES specification provides a unified API for
89+
describing workflow inputs and outputs, monitoring job status and progress, and
90+
managing data transfers. With this specification, users can build scalable,
91+
reproducible, and interoperable genomics workflows, enabling collaboration
92+
across institutions and improving data sharing. Two use cases for the GA4GH WES
93+
specification are:
94+
95+
- Scenario 1: A researcher wants to analyze a large dataset of genomic data
96+
using a specific analysis pipeline. With the WES specification, the
97+
researcher can easily define the inputs and parameters for the pipeline,
98+
select a computing platform that meets their requirements, and submit the job
99+
for execution. They can then monitor the progress of the job and receive
100+
notifications when the job is complete. This allows the researcher to focus
101+
on analyzing the results rather than managing the underlying infrastructure.
102+
103+
- Scenario 2: A clinical laboratory needs to process patient samples for
104+
genetic testing. The laboratory can use the WES specification to define the
105+
analysis pipeline and integrate it with its LIMS. This allows the laboratory
106+
to automate the processing of samples, reducing errors and turnaround time.
107+
108+
## Data Repository Service (DRS)
109+
110+
The GA4GH [DRS][ga4gh-drs] API provides a standard set of data retrieval methods
111+
to access genomic and related health data across different repositories.
112+
It allows researchers to simplify and standardize data retrieval in cloud-based
113+
environments. Some key features like Standardized data access that offers a consistent
114+
API for retrieving datasets. Cloud-agnostic means that it works across different
115+
cloud infrastructures. Two use cases for the GA4GH DRS:
116+
117+
- Scenario 1: A researcher wants to run an analysis pipeline on a dataset without
118+
worrying about where the data physically resides. The researcher uses a DRS ID
119+
to request the dataset. DRS resolves the ID to the actual storage location and
120+
provides signed URLs or access tokens and the pipeline retrieves the data
121+
seamlessly, regardless of the underlying cloud or storage system.
122+
123+
- Scenario 2: A pharmaceutical company is collaborating with hospitals to analyze
124+
patient genomic data. Due to privacy regulations, raw data cannot be moved outside
125+
the hospital’s secure environment. The hospital can expose their datasets via DRS
126+
endpointsand the pharmaceutical company's workflow engine queries DRS to get metadata.
127+
Finally, the analysis is performed without violating data residency rules.
128+
129+
## Tool Registry Service (TRS)
130+
131+
The GA4GH [TRS][ga4gh-trs] API provides a standard mechanism to list, search and
132+
register tools and workflows across different platforms and cloud environments.
133+
It supports workflows written in CWL, WDL, Nextflow, Galaxy, Snakemake.
134+
Here are examples of two use cases:
135+
136+
- Scenario 1: A bioinformatics researcher develops a workflow for variant calling
137+
using WDL and Docker containers. They want to share it with collaborators who use
138+
different platform. TRS can help, the researcher registers the workflow in a
139+
TRS-compliant registry like Dockstore. The collaborators can discover the workflow
140+
via TRS API and run it on their platform.
141+
TRS will ensure that metadata, versioning, and container are standardized and
142+
accessible
143+
144+
- Scenario 2: A hospital’s genomics lab uses an automated pipeline to analyze patient
145+
exome data for rare disease diagnosis. The pipeline queries a TRS registry to find
146+
the latest version of tools (like VEP or GATK), retrieves the workflow descriptor
147+
and container images. Finally, the pipeline executes the tools in a secure,
148+
compliant environment.

0 commit comments

Comments
 (0)