Skip to content

Commit 21d80f5

Browse files
authored
Update README.md
Updated README to be GCP-specific
1 parent 6bbddb2 commit 21d80f5

1 file changed

Lines changed: 39 additions & 32 deletions

File tree

GoogleCloud/README.md

Lines changed: 39 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,3 @@
1-
![course card](images/MDI-course-card-2.png)
2-
3-
# MDI Biological Laboratory RNA-seq Transcriptome Assembly Module
4-
---------------------------------
5-
61
## Contents
72

83
+ [Overview](#overview)
@@ -15,7 +10,7 @@
1510
+ [License for Data](#license-for-data)
1611

1712
## Overview
18-
This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly with a Cloud Computing Platform using a Nextflow pipeline. In addition to the overview given in this README, you will find README related to each platform (AWS, Google Cloud) and Jupyter notebooks that teach you different components of RNA-seq in the cloud.
13+
This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly on Google Cloud Provider (GCP) using a Nextflow pipeline. In addition to the overview given in this README you will find a glossary and three Jupyter notebooks that help you understand the basics of the workflow to running large dataset using Google Batch in the cloud. To use this module, clone the parent repository, git clone https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications.git and then navigate to the directory for this project.
1914

2015
## Learning goals:
2116
1. From a *biological perspective*, demonstration of the **process of transcriptome assembly** from raw RNA-seq data.
@@ -42,13 +37,6 @@ Image Source: https://github.com/PalMuc/TransPi/blob/master/README.md
4237

4338
Explanation of which notebooks execute which processes:
4439

45-
+ Notebooks labeled 0 ([Submodule_00_Background.ipynb](./Submodule_00_Background.ipynb) and [00_Glossary.md](./00_Glossary.md)) respectively cover background materials and provide a centralized glossary for both the biological problem of transcriptome assembly, as well as an introduction to workflows and container-based computing.
46-
+ Notebook 1 ([Submodule_01_prog_setup.ipynb](./Submodule_01_prog_setup.ipynb)) is used for setting up the environment. It should only need to be run once per machine. (Note that our version of TransPi does not run the `precheck script`. To avoid the headache and wasted time, we have developed a workaround to skip that step.)
47-
+ Notebook 2 ([Submodule_02_basic_assembly.ipynb](./Submodule_02_basic_assembly.ipynb)) carries out a complete run of the Nextflow TransPi assembly workflow on a modest sequence set, producing a small transcriptome.
48-
+ Notebook 3 ([Submodule_03_annotation_only.ipynb](./Submodule_03_annotation_only.ipynb)) carries out an annotation-only run using a prebuilt, but more complete transcriptome.
49-
+ Notebook 4 ([Submodule_04_google_batch_assembly.ipynb](./Submodule_04_google_batch_assembly.ipynb)) carries out the workflow using the Google Batch API.
50-
+ Notebook 5 ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) is a more hands-off notebook to test basic skills taught in this module.
51-
5240
## **Data**
5341
The test dataset used in the majority of this module is a downsampled version of a dataset that can be obtained in its complete form from the SRA database (Bioproject [**PRJNA318296**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA318296), GEO Accession [**GSE80221**](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80221)). The data was originally generated by **Hartig et al., 2016**. We downsampled the data files in order to streamline the performance of the tutorials and stored them in a Google Cloud Storage bucket. The sub-sampled data, in individual sample files as well as a concatenated version of these files are available in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`.
5442

@@ -61,12 +49,45 @@ Additional datasets for demonstration of the annotation features of TransPi were
6149
- Originally generated by **Wang J et al., 2016**, **Al-Tobasei R et al., 2016**, and **Salem M et al., 2015**.
6250
- Pseudacris regilla
6351
- Bioproject: [**PRJNA163143**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA163143)
64-
- Originally generated by **Laura Robertson, USGS**.
52+
- Originally generated by **Laura Robertson, USGS**.
53+
54+
## **Before Starting**
55+
These tutorials were designed to be used on Google Cloud Platforms (GCP), with the aim of requiring nothing but the files within this GitHub repository. However, you do need to set up your Google account to access GCP and the Vertex AI Workbench to use the notebooks. The steps you need before getting started:
56+
- Set up a Google Cloud account
57+
- Create a project
58+
- Enable billing
59+
- Enable APIs (Compute Engine API, Cloud Storage API, Google Batch)
60+
- Create a Nextflow service account (only needed for tutorial 4)
61+
- Create a Cloud Storage bucket ([details](https://cloud.google.com/storage/docs/creating-buckets))
62+
63+
More detailed instructions of the above steps can be found [here](docs/Before_beginning.md). Or you can also refer to [NIH Cloud Lab README](https://github.com/STRIDES/NIHCloudLabGCP) for more instructions.
64+
65+
## **Getting Started**
66+
67+
This repository contains several notebook files which serve as bioinformatics WGBS workflow tutorials. To view these notebooks on GCP, the following steps will guide you through setting up a virtual machine on Google Cloud Platform, downloading our tutorial files, and running those files.
68+
69+
### Optional: Creating a Nextflow Service Account
70+
If you are using Nextflow outside of NIH CloudLab you must set up a service account and add your service account to your notebook permissions before creating the notebook. Follow section 2 of the accompanying [How To document](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateNextflowServiceAccount.md) for instructions. If you are executing this tutorial with an NIH CloudLab account your default Compute Engine service account will have all required IAM roles to run the nextflow portion.
71+
72+
### Creating a notebook instance
73+
74+
Follow the steps highlighted [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateVertexAINotebooks.md) to create a new instance notebook in Vertex AI. Follow steps 1-8 and be especially careful to enable idle shutdown as highlighted in step 8. In step 7 in the Machine type tab, select n1-standard-4 from the dropdown box.
75+
76+
### Download the tutorials
77+
78+
To clone this repository, use the Git command `git clone https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications.git` in the dropdown menu option in Jupyter notebook. Please make sure you only enter the link for the repository that you want to clone. There are other bioinformatics related learning modules available in the [NIGMS Repository](https://github.com/NIGMS).
6579

66-
The final submodule ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) uses an additional dataset pulled from the SRA database. We are using the RNA-seq reads only and have subsampled and merged them to a collective 2 million reads. This is not a good idea for real analysis, but was done to reduce the costs and runtime. These files are avalible in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`.
67-
- Apis mellifera
68-
- Bioproject: [**PRJNA274674**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA274674)
69-
- Originally generated by **Galbraith DA et al., 2015**.
80+
### Running Tutorial Files
81+
82+
All our tutorial workflows are in [Jupyter notebook](https://docs.jupyter.org/en/latest/ "Juypter notebook documentation") format. To run these notebooks (.ipynb) you need only to double-click the tutorial files and this will open the Jupyter file in Jupyter notebook. From here you can run each section, or 'cell', of the code, one by one, by pushing the 'Play' button on the above menu.
83+
84+
Some 'cells' of code take longer for the computer to process than others. You will know a cell is running when a cell has an asterisk next to it **[*]**. When the cell finishes running, that asterisk will be replaced with a number which represents the order that cell was run in.
85+
86+
You can now explore the tutorials by running the code in each, from top to bottom. Look at the [Overview](#overview) section for a short description of each tutorial.
87+
88+
### Stopping Your Virtual Machine
89+
90+
When you are finished running code, you can turn off your virtual machine to prevent unneeded billing or resource use by checking your notebook and clicking the **Stop** button.
7091

7192
## **Troubleshooting**
7293
- If a quiz is not rendering:
@@ -78,17 +99,3 @@ The final submodule ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_No
7899
- If you are unable to create a bucket using the `gsutil mb` command, check your `nextflow-service-account` roles. Make sure that you have `Storage Admin` added.
79100
- If you are trying to execute a terminal command in a Jupyter code cell and it is not working, make sure that you have an `!` before the command.
80101
- e.g., `mkdir example-1` -> `!mkdir example-1`
81-
82-
## **Funding**
83-
84-
MDIBL Computational Biology Core efforts are supported by two Institutional Development Awards (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant numbers P20GM103423 and P20GM104318.
85-
86-
## **License for Data**
87-
88-
Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it. More information is available [here](https://tilburgsciencehub.com/about).
89-
90-
![Creative commons license](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)
91-
92-
This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/)
93-
94-
The TransPi Nextflow workflow was developed and released by Ramon Rivera and can be obtained from its [GitHub repository](https://github.com/PalMuc/TransPi)

0 commit comments

Comments
 (0)