Update README.md

jghanaim04 · web-flow · commit 21d80f58e5fa · 2025-05-06T14:51:49.000-04:00
Updated README to be GCP-specific
diff --git a/GoogleCloud/README.md b/GoogleCloud/README.md
@@ -1,8 +1,3 @@
-![course card](images/MDI-course-card-2.png)
-
-# MDI Biological Laboratory RNA-seq Transcriptome Assembly Module
----------------------------------
-
 ## Contents
 
 + [Overview](#overview)
@@ -15,7 +10,7 @@
 + [License for Data](#license-for-data)
 
 ## Overview
-This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly with a Cloud Computing Platform using a Nextflow pipeline. In addition to the overview given in this README, you will find README related to each platform (AWS, Google Cloud) and Jupyter notebooks that teach you different components of RNA-seq in the cloud. 
+This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly on Google Cloud Provider (GCP) using a Nextflow pipeline. In addition to the overview given in this README you will find a glossary and three Jupyter notebooks that help you understand the basics of the workflow to running large dataset using Google Batch in the cloud. To use this module, clone the parent repository, git clone https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications.git and then navigate to the directory for this project.
 
 ## Learning goals:
 1. From a *biological perspective*, demonstration of the **process of transcriptome assembly** from raw RNA-seq data.
@@ -42,13 +37,6 @@ Image Source: https://github.com/PalMuc/TransPi/blob/master/README.md
 
 Explanation of which notebooks execute which processes:
 
-+ Notebooks labeled 0 ([Submodule_00_Background.ipynb](./Submodule_00_Background.ipynb) and [00_Glossary.md](./00_Glossary.md)) respectively cover background materials and provide a centralized glossary for both the biological problem of transcriptome assembly, as well as an introduction to workflows and container-based computing.
-+ Notebook 1 ([Submodule_01_prog_setup.ipynb](./Submodule_01_prog_setup.ipynb)) is used for setting up the environment. It should only need to be run once per machine. (Note that our version of TransPi does not run the `precheck script`. To avoid the headache and wasted time, we have developed a workaround to skip that step.)
-+ Notebook 2 ([Submodule_02_basic_assembly.ipynb](./Submodule_02_basic_assembly.ipynb)) carries out a complete run of the Nextflow TransPi assembly workflow on a modest sequence set, producing a small transcriptome.
-+ Notebook 3 ([Submodule_03_annotation_only.ipynb](./Submodule_03_annotation_only.ipynb)) carries out an annotation-only run using a prebuilt, but more complete transcriptome.
-+ Notebook 4 ([Submodule_04_google_batch_assembly.ipynb](./Submodule_04_google_batch_assembly.ipynb)) carries out the workflow using the Google Batch API.
-+ Notebook 5 ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) is a more hands-off notebook to test basic skills taught in this module.
-
 ## **Data** 
 The test dataset used in the majority of this module is a downsampled version of a dataset that can be obtained in its complete form from the SRA database (Bioproject [**PRJNA318296**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA318296), GEO Accession [**GSE80221**](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80221)). The data was originally generated by **Hartig et al., 2016**. We downsampled the data files in order to streamline the performance of the tutorials and stored them in a Google Cloud Storage bucket. The sub-sampled data, in individual sample files as well as a concatenated version of these files are available in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`.
 
@@ -61,12 +49,45 @@ Additional datasets for demonstration of the annotation features of TransPi were
     - Originally generated by **Wang J et al., 2016**, **Al-Tobasei R et al., 2016**, and **Salem M et al., 2015**. 
 - Pseudacris regilla
     - Bioproject: [**PRJNA163143**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA163143)
-    - Originally generated by **Laura Robertson, USGS**. 
+    - Originally generated by **Laura Robertson, USGS**.
+ 
+## **Before Starting**
+These tutorials were designed to be used on Google Cloud Platforms (GCP), with the aim of requiring nothing but the files within this GitHub repository. However, you do need to set up your Google account to access GCP and the Vertex AI Workbench to use the notebooks. The steps you need before getting started:
+- Set up a Google Cloud account
+- Create a project
+- Enable billing
+- Enable APIs (Compute Engine API, Cloud Storage API, Google Batch)
+- Create a Nextflow service account (only needed for tutorial 4)
+- Create a Cloud Storage bucket ([details](https://cloud.google.com/storage/docs/creating-buckets))
+
+More detailed instructions of the above steps can be found [here](docs/Before_beginning.md). Or you can also refer to [NIH Cloud Lab README](https://github.com/STRIDES/NIHCloudLabGCP) for more instructions.
+
+## **Getting Started**
+
+This repository contains several notebook files which serve as bioinformatics WGBS workflow tutorials. To view these notebooks on GCP, the following steps will guide you through setting up a virtual machine on Google Cloud Platform, downloading our tutorial files, and running those files.
+
+### Optional: Creating a Nextflow Service Account
+If you are using Nextflow outside of NIH CloudLab you must set up a service account and add your service account to your notebook permissions before creating the notebook. Follow section 2 of the accompanying [How To document](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateNextflowServiceAccount.md) for instructions. If you are executing this tutorial with an NIH CloudLab account your default Compute Engine service account will have all required IAM roles to run the nextflow portion.
+
+### Creating a notebook instance 
+
+Follow the steps highlighted [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateVertexAINotebooks.md) to create a new instance notebook in Vertex AI. Follow steps 1-8 and be especially careful to enable idle shutdown as highlighted in step 8. In step 7 in the Machine type tab, select n1-standard-4 from the dropdown box.
+
+### Download the tutorials
+
+To clone this repository, use the Git command `git clone https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications.git` in the dropdown menu option in Jupyter notebook. Please make sure you only enter the link for the repository that you want to clone. There are other bioinformatics related learning modules available in the [NIGMS Repository](https://github.com/NIGMS).
 
-The final submodule ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) uses an additional dataset pulled from the SRA database. We are using the RNA-seq reads only and have subsampled and merged them to a collective 2 million reads. This is not a good idea for real analysis, but was done to reduce the costs and runtime. These files are avalible in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`.
-- Apis mellifera
-    - Bioproject: [**PRJNA274674**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA274674)
-    - Originally generated by **Galbraith DA et al., 2015**.
+### Running Tutorial Files
+
+All our tutorial workflows are in [Jupyter notebook](https://docs.jupyter.org/en/latest/ "Juypter notebook documentation") format. To run these notebooks (.ipynb) you need only to double-click the tutorial files and this will open the Jupyter file in Jupyter notebook. From here you can run each section, or 'cell', of the code, one by one, by pushing the 'Play' button on the above menu.
+
+Some 'cells' of code take longer for the computer to process than others. You will know a cell is running when a cell has an asterisk next to it **[*]**. When the cell finishes running, that asterisk will be replaced with a number which represents the order that cell was run in.
+
+You can now explore the tutorials by running the code in each, from top to bottom. Look at the [Overview](#overview) section for a short description of each tutorial.
+
+### Stopping Your Virtual Machine
+
+When you are finished running code, you can turn off your virtual machine to prevent unneeded billing or resource use by checking your notebook and clicking the **Stop** button.
 
 ## **Troubleshooting**
 - If a quiz is not rendering:
@@ -78,17 +99,3 @@ The final submodule ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_No
 - If you are unable to create a bucket using the `gsutil mb` command, check your `nextflow-service-account` roles. Make sure that you have `Storage Admin` added.
 - If you are trying to execute a terminal command in a Jupyter code cell and it is not working, make sure that you have an `!` before the command.
     - e.g., `mkdir example-1` -> `!mkdir example-1`
-
-## **Funding** 
-
-MDIBL Computational Biology Core efforts are supported by two Institutional Development Awards (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant numbers P20GM103423 and P20GM104318.
-
-## **License for Data** 
-
-Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it. More information is available [here](https://tilburgsciencehub.com/about).
-
-![Creative commons license](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)
-
-This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/)  
-
-The TransPi Nextflow workflow was developed and released by Ramon Rivera and can be obtained from its [GitHub repository](https://github.com/PalMuc/TransPi)