|
| 1 | +# GA4GH Cloud Workstream |
| 2 | + |
| 3 | +The GA4GH (Global Alliance for Genomics and Health) cloud [APIs][ga4gh-cloud] |
| 4 | +are a set of standard APIs that provide a common interface for accessing |
| 5 | +genomic data and tools across different cloud providers. These APIs are |
| 6 | +essential for enabling genomic data sharing and collaboration, and they have |
| 7 | +been adopted by major cloud providers such as Google Cloud Platform, Microsoft |
| 8 | +Azure, and Amazon Web Services. In this documentation, we'll cover four main |
| 9 | +GA4GH APIs that you'll be using: the Workflow Execution Service |
| 10 | +([WES][ga4gh-wes]), the Task Execution Service ([TES][ga4gh-tes]), the Data |
| 11 | +Repository Service ([DRS][ga4gh-drs]), and the Tool Registry Service |
| 12 | +([TRS][ga4gh-trs]). The WES API allows you to define and execute workflows, |
| 13 | +while the TES API allows you to execute individual tasks within those |
| 14 | +workflows. The DRS API provides a way to access and download genomic data, and |
| 15 | +the TRS API enables the discovery of genomic analysis tools. |
| 16 | + |
| 17 | +Whether you are a bioinformatician or a data scientist, this site will |
| 18 | +provide you with all the information you need to start using ELIXIR's GA4GH |
| 19 | +cloud services ecosystem and harness the power of cloud computing for your |
| 20 | +genomic data analysis needs. Let's get started! |
| 21 | + |
| 22 | +## Task Execution Service (TES) |
| 23 | + |
| 24 | +The GA4GH [TES][ga4gh-tes] specification is a standard interface that enables |
| 25 | +interoperability between workflow management systems and execution engines. The |
| 26 | +TES specification provides a uniform way to submit and monitor tasks to any |
| 27 | +execution engine that implements the specification, allowing users to easily |
| 28 | +switch between workflow management systems or execution engines without |
| 29 | +rewriting their workflows. Typical use cases are |
| 30 | + |
| 31 | +- Scenario 1: A researcher wants to run a workflow locally. The workflow |
| 32 | + contains some resource-intensive steps, such as requirements for GPUs or many |
| 33 | + cores. Using TES as a backend, the researcher can execute the workflow |
| 34 | + locally and also send the resource-intensive tasks to cloud servers for |
| 35 | + execution. |
| 36 | +- Scenario 2: A researcher wants to run a workflow that involves processing |
| 37 | + data that is stored in cloud locations. Using TES would allow individual |
| 38 | + tasks to be sent to the compute locations associated with each storage |
| 39 | + location. This may be relevant if the data provider does not allow files to |
| 40 | + be downloaded to a central location or if it is not technically feasible. |
| 41 | + |
| 42 | +The TES specification defines a HTTP API for submitting and monitoring tasks |
| 43 | +that includes endpoints for creating, querying, updating, and canceling tasks. |
| 44 | +Tasks are defined as JSON objects that include input and output files, a |
| 45 | +command to execute, and any environment variables or resources required by the |
| 46 | +task. The TES specification also includes mechanisms for handling task |
| 47 | +dependencies and retrying failed tasks. Popular TES implementations are |
| 48 | +[Funnel][funnel] and [TESK][tesk]. |
| 49 | + |
| 50 | +Several popular workflow management systems, including [cwl-tes][cwl-tes], |
| 51 | +[Snakemake][snakemake] and [Nextflow][nextflow], have implemented the TES |
| 52 | +specification, allowing users to easily run their workflows on any execution |
| 53 | +engine that supports TES. |
| 54 | + |
| 55 | +### Snakemake |
| 56 | + |
| 57 | +Snakemake supports TES v1.0 since version 5.28.0, as described in the |
| 58 | +[Snakemake documentation][snakemake-docs]. Snakemake executes individual tasks |
| 59 | +as separate workflows that execute only one or a few rules. When using TES, it |
| 60 | +is recommended to use additional remote storage to store input and output |
| 61 | +files. By default, Snakemake TES tasks are executed using the official |
| 62 | +Snakemake container image in the same version as the original Snakemake call. |
| 63 | +To use specific tools, conda environments should be appended to the rules. A |
| 64 | +demo workflow is available |
| 65 | +[here][elixir-cloud-demo-smk]. |
| 66 | + |
| 67 | +### CWL-tes |
| 68 | + |
| 69 | +A demo workflow is available [here][elixir-cloud-demo-cwl]. |
| 70 | + |
| 71 | +### Nextflow |
| 72 | + |
| 73 | +You can find an article about NextFlow with GA4GH TES [here](https://techcommunity.microsoft.com/blog/healthcareandlifesciencesblog/introducing-nextflow-with-ga4gh-tes-a-new-era-of-scalable-data-processing-on-azu/4253160) |
| 74 | + |
| 75 | +To use TES in your Nextflow config, use the plugin `nf-ga4gh`: |
| 76 | + |
| 77 | +``` |
| 78 | +plugins { |
| 79 | + id 'nf-ga4gh' |
| 80 | +} |
| 81 | +``` |
| 82 | + |
| 83 | +## Workflow Execution Service (WES) |
| 84 | + |
| 85 | +The GA4GH [WES][ga4gh-wes] is a standard specification protocol for executing |
| 86 | +and monitoring bioinformatics workflows. It allows researchers to easily |
| 87 | +execute and manage complex analysis pipelines across multiple computing |
| 88 | +platforms and institutions. The WES specification provides a unified API for |
| 89 | +describing workflow inputs and outputs, monitoring job status and progress, and |
| 90 | +managing data transfers. With this specification, users can build scalable, |
| 91 | +reproducible, and interoperable genomics workflows, enabling collaboration |
| 92 | +across institutions and improving data sharing. Two use cases for the GA4GH WES |
| 93 | +specification are: |
| 94 | + |
| 95 | +- Scenario 1: A researcher wants to analyze a large dataset of genomic data |
| 96 | + using a specific analysis pipeline. With the WES specification, the |
| 97 | + researcher can easily define the inputs and parameters for the pipeline, |
| 98 | + select a computing platform that meets their requirements, and submit the job |
| 99 | + for execution. They can then monitor the progress of the job and receive |
| 100 | + notifications when the job is complete. This allows the researcher to focus |
| 101 | + on analyzing the results rather than managing the underlying infrastructure. |
| 102 | + |
| 103 | +- Scenario 2: A clinical laboratory needs to process patient samples for |
| 104 | + genetic testing. The laboratory can use the WES specification to define the |
| 105 | + analysis pipeline and integrate it with its LIMS. This allows the laboratory |
| 106 | + to automate the processing of samples, reducing errors and turnaround time. |
| 107 | + |
| 108 | +## Data Repository Service (DRS) |
| 109 | + |
| 110 | +The GA4GH [DRS][ga4gh-drs] API provides a standard set of data retrieval methods |
| 111 | +to access genomic and related health data across different repositories. |
| 112 | +It allows researchers to simplify and standardize data retrieval in cloud-based |
| 113 | +environments. Some key features like Standardized data access that offers a consistent |
| 114 | +API for retrieving datasets. Cloud-agnostic means that it works across different |
| 115 | +cloud infrastructures. Two use cases for the GA4GH DRS: |
| 116 | + |
| 117 | +- Scenario 1: A researcher wants to run an analysis pipeline on a dataset without |
| 118 | + worrying about where the data physically resides. The researcher uses a DRS ID |
| 119 | + to request the dataset. DRS resolves the ID to the actual storage location and |
| 120 | + provides signed URLs or access tokens and the pipeline retrieves the data |
| 121 | + seamlessly, regardless of the underlying cloud or storage system. |
| 122 | + |
| 123 | +- Scenario 2: A pharmaceutical company is collaborating with hospitals to analyze |
| 124 | + patient genomic data. Due to privacy regulations, raw data cannot be moved outside |
| 125 | + the hospital’s secure environment. The hospital can expose their datasets via DRS |
| 126 | + endpointsand the pharmaceutical company's workflow engine queries DRS to get metadata. |
| 127 | + Finally, the analysis is performed without violating data residency rules. |
| 128 | + |
| 129 | +## Tool Registry Service (TRS) |
| 130 | + |
| 131 | +The GA4GH [TRS][ga4gh-trs] API provides a standard mechanism to list, search and |
| 132 | +register tools and workflows across different platforms and cloud environments. |
| 133 | +It supports workflows written in CWL, WDL, Nextflow, Galaxy, Snakemake. |
| 134 | +Here are examples of two use cases: |
| 135 | + |
| 136 | +- Scenario 1: A bioinformatics researcher develops a workflow for variant calling |
| 137 | + using WDL and Docker containers. They want to share it with collaborators who use |
| 138 | + different platform. TRS can help, the researcher registers the workflow in a |
| 139 | + TRS-compliant registry like Dockstore. The collaborators can discover the workflow |
| 140 | + via TRS API and run it on their platform. |
| 141 | + TRS will ensure that metadata, versioning, and container are standardized and |
| 142 | + accessible |
| 143 | + |
| 144 | +- Scenario 2: A hospital’s genomics lab uses an automated pipeline to analyze patient |
| 145 | + exome data for rare disease diagnosis. The pipeline queries a TRS registry to find |
| 146 | + the latest version of tools (like VEP or GATK), retrieves the workflow descriptor |
| 147 | + and container images. Finally, the pipeline executes the tools in a secure, |
| 148 | + compliant environment. |
0 commit comments