Skip to content

Commit 90118c1

Browse files
authored
Update README.md
Generalized for AWS/GCP
1 parent 8e04ff2 commit 90118c1

1 file changed

Lines changed: 11 additions & 36 deletions

File tree

README.md

Lines changed: 11 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
+ [License for Data](#license-for-data)
1111

1212
## Overview
13-
This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly on Amazon Web Services (AWS) using a Nextflow pipeline.
13+
This module teaches you how to perform a short-read RNA-seq Transcriptome Assembly using a Nextflow pipeline.
1414

1515
## Learning goals:
1616
1. From a *biological perspective*, demonstration of the **process of transcriptome assembly** from raw RNA-seq data.
@@ -37,17 +37,14 @@ Image Source: https://github.com/PalMuc/TransPi/blob/master/README.md
3737

3838
Explanation of which notebooks execute which processes:
3939

40-
+ Notebooks labeled 0 ([Submodule_00_Background.ipynb](./Submodule_00_Background.ipynb) and [00_Glossary.md](./00_Glossary.md)) respectively cover background materials and provide a centralized glossary for both the biological problem of transcriptome assembly, as well as an introduction to workflows and container-based computing.
41-
+ Notebook 1 ([Submodule_01_prog_setup.ipynb](./Submodule_01_prog_setup.ipynb)) is used for setting up the environment. It should only need to be run once per machine. (Note that our version of TransPi does not run the `precheck script`. To avoid the headache and wasted time, we have developed a workaround to skip that step.)
42-
+ Notebook 2 ([Submodule_02_basic_assembly.ipynb](./Submodule_02_basic_assembly.ipynb)) carries out a complete run of the Nextflow TransPi assembly workflow on a modest sequence set, producing a small transcriptome.
40+
+ Notebooks labeled 0 and 1 ([Submodule_0_Glossary.md](./Submodule_0_Glossary.md) and [Submodule_1_background.ipynb](./Submodule_1_background.ipynb)) cover background materials and provide a centralized glossary for both the biological problem of transcriptome assembly, as well as an introduction to workflows and container-based computing.
41+
+ Notebook 2 ([Submodule_02_basic_assembly.ipynb](./Submodule_02_basic_assembly.ipynb)) carries out a complete run of the Nextflow assembly workflow on a modest sequence set, producing a small transcriptome.
4342
+ Notebook 3 ([Submodule_03_annotation_only.ipynb](./Submodule_03_annotation_only.ipynb)) carries out an annotation-only run using a prebuilt, but more complete transcriptome.
44-
+ Notebook 4 ([Submodule_04_google_batch_assembly.ipynb](./Submodule_04_google_batch_assembly.ipynb)) carries out the workflow using the Google Batch API.
45-
+ Notebook 5 ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) is a more hands-off notebook to test basic skills taught in this module.
4643

4744
## **Data**
48-
The test dataset used in the majority of this module is a downsampled version of a dataset that can be obtained in its complete form from the SRA database (Bioproject [**PRJNA318296**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA318296), GEO Accession [**GSE80221**](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80221)). The data was originally generated by **Hartig et al., 2016**. <mark>We downsampled the data files in order to streamline the performance of the tutorials and stored them in a Google Cloud Storage bucket. The sub-sampled data, in individual sample files as well as a concatenated version of these files are available in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`</mark>.
45+
The test dataset used in the majority of this module is a downsampled version of a dataset that can be obtained in its complete form from the SRA database (Bioproject [**PRJNA318296**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA318296), GEO Accession [**GSE80221**](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80221)). The data was originally generated by **Hartig et al., 2016**. We downsampled the data files in order to streamline the performance of the tutorials and stored them in a Google Cloud/S3 bucket. The sub-sampled data, in individual sample files as well as a concatenated version of these files are available in our Google Cloud/S3 bucket.
4946

50-
Additional datasets for demonstration of the annotation features of TransPi were obtained from the NCBI Transcriptome Shotgun Assembly archive. <mark>These files can be found in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/trans`</mark>.
47+
Additional datasets for demonstration of the annotation features were obtained from the NCBI Transcriptome Shotgun Assembly archive. These files can be found in our Google Cloud/S3 bucket.
5148
- Microcaecilia dermatophaga
5249
- Bioproject: [**PRJNA387587**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA387587)
5350
- Originally generated by **Torres-Sánchez M et al., 2019**.
@@ -57,39 +54,17 @@ Additional datasets for demonstration of the annotation features of TransPi were
5754
- Pseudacris regilla
5855
- Bioproject: [**PRJNA163143**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA163143)
5956
- Originally generated by **Laura Robertson, USGS**.
60-
61-
The final submodule ([Submodule_05_Bonus_Notebook.ipynb](./Submodule_05_Bonus_Notebook.ipynb)) uses an additional dataset pulled from the SRA database. We are using the RNA-seq reads only and have subsampled and merged them to a collective 2 million reads. This is not a good idea for real analysis, but was done to reduce the costs and runtime. These files are avalible in our Google Cloud Storage bucket at `gs://nigms-sandbox/nosi-inbremaine-storage/resources/seq2`.
62-
- Apis mellifera
63-
- Bioproject: [**PRJNA274674**](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA274674)
64-
- Originally generated by **Galbraith DA et al., 2015**.
6557

66-
## **Getting Started**
67-
68-
This repository contains several Jupyter notebook files which serve as bioinformatics WGBS workflow tutorials. To view these notebooks on AWS, the following steps will guide you through setting up a notebook instance on SageMaker AI, downloading our tutorial files, and running those files.
69-
70-
### Creating a notebook instance
71-
72-
**1)** Follow the steps highlighted [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateAWSSagemakerNotebooks.md) to create a new notebook instance in Amazon SageMaker. Follow steps and be especially careful to enable idle shutdown as highlighted. For this module, in [step 4](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateAWSSagemakerNotebooks.md) in the "Notebook instance type" tab, select ml.m5.xlarge from the dropdown box. Select conda_python3 kernel in [step 8](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateAWSSagemakerNotebooks.md).
58+
## **Funding**
7359

74-
**2)** You will need to download the tutorial files from GitHub. The easiest way to do this would be to clone the repository from NIGMS into your Amazon SageMaker notebook. To clone this repository, use the Git symbole on left menu and then insert the link `https://github.com/NIGMS/Transcriptome-Assembly-Refinement-and-Applications.git` as it is illustrated in [step 7](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateAWSSagemakerNotebooks.md). Please make sure you only enter the link for the repository that you want to clone. There are other bioinformatics related learning modules available in the [NIGMS Repository](https://github.com/NIGMS). This will download our tutorial files into a folder called `Transcriptome-Assembly-Refinement-and-Applications`.
60+
Funded by NIH/NIGMS P20GM103466.
7561

76-
### Running Tutorial Files
62+
## **License for Data**
7763

78-
All our tutorial workflows are in [Jupyter notebook](https://docs.jupyter.org/en/latest/ "Juypter notebook documentation") format. To run these notebooks (.ipynb) you need only to double-click the tutorial files and this will open the Jupyter file in Jupyter notebook. From here you can run each section, or 'cell', of the code, one by one, by pushing the 'Play' button on the above menu.
64+
Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it. More information is available [here](https://tilburgsciencehub.com/about).
7965

80-
Some 'cells' of code take longer for the computer to process than others. You will know a cell is running when a cell has an asterisk next to it **[*]**. When the cell finishes running, that asterisk will be replaced with a number which represents the order that cell was run in.
66+
![Creative commons license](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)
8167

82-
### Stopping Your Notebook
68+
This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/)
8369

84-
Make sure that after you are done with the module, close the tab that appeared when you clicked **OPEN JUPYTERLAB**, then check the box next to the name of the notebook you created in [step 3](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToCreateAWSSagemakerNotebooks.md). Then click on **STOP** at the top of the Workbench menu. Wait and make sure that the icon next to your notebook is grayed out.
8570

86-
## **Troubleshooting**
87-
- If a quiz is not rendering:
88-
- Make sure the `pip install` cell was executed in Submodule 00.
89-
- Try re-executing `from jupytercards import display_flashcards` or `from jupyterquiz import display_quiz` depending on the quiz type.
90-
- If a file/directory is not able to be found, make sure that you are in the right directory. If the notebook is idle for a long time, gets reloaded, or restarted, you will need to re-run Step 1 of the notebook. (`%cd /home/jupyter`)
91-
- Sometimes, Nextflow will print `WARN:` followed by the warning. These are okay and should not produce any errors.
92-
- Sometimes Nextflow will print `Waiting for file transfers to complete`. This may take a few minutes, but is nothing to worry about.
93-
- If you are unable to create a bucket using the `gsutil mb` command, check your `nextflow-service-account` roles. Make sure that you have `Storage Admin` added.
94-
- If you are trying to execute a terminal command in a Jupyter code cell and it is not working, make sure that you have an `!` before the command.
95-
- e.g., `mkdir example-1` -> `!mkdir example-1`

0 commit comments

Comments
 (0)