Finished updating submodule 3 readme

aolveraNIH · aolveraNIH · commit c48549cc4094 · 2025-05-06T13:32:27.000-06:00
diff --git a/GoogleCloud/Submodule_3_basic_assembly.ipynb b/GoogleCloud/Submodule_3_basic_assembly.ipynb
@@ -208,15 +208,6 @@
     "The beauty and power of using a defined workflow in a management system (such as Nextflow) are that we not only get a defined set of steps that are carried out in the proper order, but we also get a well-structured and concise directory structure that holds all pertinent output."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "3c8c137c",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "# Andrea, please update the rest for result"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "92ed51eb",
@@ -232,7 +223,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "! gsutil ls s3://<YOUR-BUCKET-NAME>/<Your-Output-Directory>/"
+    "! gsutil ls gs://<YOUR-BUCKET-NAME>/<Your-Output-Directory>/"
    ]
   },
   {
@@ -241,7 +232,7 @@
    "metadata": {},
    "source": [
     "## Investigation and Exploration: Assembly and Annotation Results\n",
-    "The use of an established and complex multi-step workflow (such as the TransPi workflow that you just ran) has the benefit of saving you a lot of manual effort in setting up and carrying out the individual steps yourself. It also is highly reproducible, given the same input data and parameters.\n",
+    "The use of an established and complex multi-step workflow (such as the de novo transcriptome assembly workflow that you just ran) has the benefit of saving you a lot of manual effort in setting up and carrying out the individual steps yourself. It also is highly reproducible, given the same input data and parameters.\n",
     "\n",
     "It does, however, generate a lot of output, and it is beyond the scope of this training exercise to go through all of it in detail. We recommend that you download the complete results directory onto another machine or storage so that you can view it at your convenience, and on a less expensive machine than you are using to run this tutorial. *If you would like the proceed with the data in its current location, this also works, just bear in mind that it will cost roughly $0.72 per hour.*\n",
     "\n",
@@ -252,12 +243,12 @@
     "\n",
     ">Here are two possible options to access the results files outside of this expensive JupyterLab instance.  \n",
     ">- If you instead have an external machine that accepts ssh connections, then you can use the secure copy scp command: `!scp -r ./basicRun/output/YOUR_USERID@YOUR.MACHINE`\n",
-    ">- If you have a Google Cloud Storage bucket available, you can use the gsutil command: `!gsutil -m cp -r ./basicRun/output gs://YOUR-BUCKET-NAME-HERE/transpi-basicrun-output` to place all of your results into that bucket. \n",
+    ">- If you have a Google Cloud Storage bucket available, you can use the gsutil command: `!gsutil -m cp -r ./basicRun/output gs://<YOUR-BUCKET-NAME-HERE>/<Your-Output-Directory>` to place all of your results into that bucket. \n",
     ">    - From there you have two options: \n",
-    ">         1. (Recommended) You could create a new (cheaper) Vertex AI instance (or use an old one) and copy the files down into that new directory using the following gsutil command:`!gsutil -m cp -r gs://YOUR-BUCKET-NAME-HERE/transpi-basicrun-output ./`\n",
+    ">         1. (Recommended) You could create a new (cheaper) Vertex AI instance (or use an old one) and copy the files down into that new directory using the following gsutil command:`!gsutil -m cp -r gs://<YOUR-BUCKET-NAME-HERE>/<Your-Output-Directory> ./`\n",
     ">         2. You could navigate to the bucket through the Google Cloud console and open the files through the links labeled `Authenticated URL`\n",
     ">\n",
-    ">**In all of the commands above, you will need to edit the All-Caps part to match your own bucket or machine.**\n",
+    ">**In all of the commands above, you will need to edit the All-Caps part or part within <> to match your own bucket or machine.**\n",
     "\n",
     "<div class=\"alert alert-block alert-info\">\n",
     "    <i class=\"fa fa-lightbulb-o\" aria-hidden=\"true\"></i>\n",
@@ -276,7 +267,7 @@
    "metadata": {},
    "source": [
     "## Output Overview\n",
-    "*These sub-directories will be mentioned in the order of their execution within TransPi.*\n",
+    "*These sub-directories will be mentioned in the order of their execution within the  de novo transcriptome assembly pipeline.*\n",
     "\n",
     "<div class=\"alert alert-block alert-success\">\n",
     "    <i class=\"fa fa-hand-paper-o\" aria-hidden=\"true\"></i>\n",
@@ -290,9 +281,17 @@
     "\n",
     "### Filter\n",
     "> FastP is a bioinformatics tool that preprocesses the raw read data. It trims poor-quality reads, removes adapter sequences, and corrects errors noticed within the reads. The `joined.fastp.html` provides an overview of the processing done on both read files.\n",
+    "> SortMeRNA is an optional bioinformatics tool designed to efficiently remove ribosomal RNA (rRNA) and mitochondrial DNA from sequencing data. By filtering out these abundant and often unwanted sequences, SortMeRNA helps to enrich the dataset for more relevant RNA species, such as messenger RNA (mRNA) or other non-coding RNAs. This preprocessing step enhances downstream analyses, such as transcriptome profiling or differential expression studies, by reducing noise and improving the accuract of the results.\n",
     "\n",
     "### Assemblies\n",
-    "> TransPi uses five different assembly tools. All of the assembly `.fa` files are placed within the assemblies directory. For all of the assemblies except for Trinity, there are four `.fa` files: one for each of the *k*-mer length plus a compilation of all three. Trinity does not have the option to customize the *k*-mer size. Instead, it runs at a default `k=25`, therefore only having one assembly.\n",
+    "> Transcriptome assembly is a critical step in RNA sequencing analysis, aimed at reconstructing the full set of transcripts from the sequenced reads. The denovo transcriptome assembly uses several tools and approaches for this purpose: \n",
+    "\n",
+    ">>\n",
+    ">> - Trinity with Normalized Reads (default = True): Trinity is a widely-used tool for de novo transcriptome assembly. Using normalized reads helps to reduce computational burden and memory usage by adjusting the read coverage, leading to a more efficient and accurate assembly process. Trinity runs at a default k-mer size of 25, providing a single assembly output.\n",
+    ">> - Trinity with Non-Normalized Reads: This approach uses raw read data without normalization. While it may require more computational resources, it can potentially capture low-abundance transcripts that might be lost during normalization. Like the normalized option, it runs at a default k-mer size of 25, resulting in one assembly output.\n",
+    ">> - rnaSPAdes Medium Filtered Transcripts (default=True): rnaSPAdes is another powerful tool for transcriptome assembly. The medium filtering option balances sensitivity and specificity, providing a reliable set of assembled transcripts by removing some low-confidence sequences while retaining most of the true transcripts. This option generates four .fa files: one for each k-mer length and a compilation of all three.\n",
+    ">> - rnaSPAdes Soft Filtered Transcripts: This option applies less stringent filtering criteria, which can be useful for capturing a broader range of transcripts, including those with lower confidence. It may be beneficial in exploratory analyses where sensitivity is prioritized. It also generates four .fa files as described above.\n",
+    ">> - rnaSPAdes Hard Filtered Transcripts: This option applies more stringent filtering criteria, resulting in a high-confidence set of assembled transcripts. It is ideal for downstream applications where specificity and accuracy are critical, such as functional annotation or differential expression analysis. Similar to the other rnaSPAdes options, it generates four .fa files.\n",
     "\n",
     "### EviGene\n",
     "> At this point, we have a major overassembly of the transcriptome. We use a small piece of the EvidentialGene (EviGene) program known as tr2aacds which takes all of the assemblies and crunches them into a single, unified transcriptome. Within the evigene directory, there are two files: `joined.combined.fa` is all of the assemblies placed into the same file and`joined.combined.okay.fa` is the combined transcriptome after EviGene has reduced it down. In each header line, there is key information about the sequence.\n",
@@ -305,10 +304,7 @@
     ">> - For more information on interpreting the headers from EviGene, reference the following [link](http://arthropods.eugenes.org/EvidentialGene/evigene/) in section 3.\n",
     "\n",
     "### BUSCO\n",
-    "> BUSCO uses a database of known universal single-copy orthologs under a specific lineage (vertebrata in this case) and checks our assembled transcriptome for those sequences which it expects to find. BUSCO was run on both the TransPi assembly along with the assembly just done by Trinity. To visualize BUSCO's results, refer to the `short_summary.specific.vertebrata_odb10.joined.TransPi.bus4.txt` and `short_summary.specific.vertebrata_odb10.joined.Trinity.bus4.txt` files.\n",
-    "\n",
-    "### Mapping \n",
-    "> One way to verify the quality of the assembly is to map the original input reads to the assembly (using an alignment program called bowtie2). There are two output files, one for the TransPi assembly and one for the Trinity exclusive assembly. These files are named `log_joined.combined.okay.fa.txt` and `log_joined.Trinity.fa.txt`.\n",
+    "> BUSCO uses a database of known universal single-copy orthologs under a specific lineage (vertebrata in this case) and checks our assembled transcriptome for those sequences which it expects to find. To visualize BUSCO's results, refer to the `short_summary.specific.vertebrata_odb10.joined.TransPi.bus4.txt` and `short_summary.specific.vertebrata_odb10.joined.Trinity.bus4.txt` files.\n",
     "\n",
     "### rnaQUAST\n",
     "> rnaQUAST is another assembly assessment program. It provides statistics about the transcripts that have been produced. For a brief overview of the transcript statistics, refer to `joined_rnaQUAST.csv`.\n",
@@ -319,17 +315,30 @@
     "### Trinotate\n",
     "> Trinotate uses the information regarding likely coding regions produced by TransDecoder to make predictions about potential protein function. It does this by cross-referencing the assembled transcripts to various databases such as pfam and hmmer. These annotations can be viewed in the `joined.trinotate_annotation_report.xls` file.\n",
     "\n",
-    "### Report\n",
-    "> Within `report` is one file: `TransPi_Report_joined.html`. This is an HTML file that combines the results throughout TransPi into a series of visual tables and figures.\n",
-    ">> The sub-directories `stats` and `figures` are intermediary sub-directories that hold information to generate the report.\n",
+    "### TransRate\n",
+    "TransRate is a tool used for assessing the quality of transcriptome assemblies. It evaluates the accuracy and completeness of the assembled transcripts by mapping the original reads back to the assembly. This process helps identify misassemblies, incomplete transcripts, and other potential issues, providing a comprehensive quality score for the transcriptome. However, this step is not performed if the profile is set to conda or mamba. By using TransRate, researchers can ensure the reliability of their transcriptome data before proceeding to downstream analyses.\n",
+    "\n",
+    "### Salmon\n",
+    "Salmon is a highly efficient tool for transcript quantification that uses a technique called pseudo-alignment. This approach allows for the rapid and accurate estimation of transcript abundance by mapping sequencing reads to a reference transcriptome without the need for full alignment. Salmon's lightweight and fast algorithm significantly reduces computational time and resources while maintaining high accuracy. It provides detailed quantification of transcript levels, which is essential for downstream analyses such as differential expression studies and gene expression profiling.\n",
+    "\n",
+    "### MultiQC\n",
+    "MultiQC is a versatile tool that aggregates results from multiple bioinformatics analyses into a single, comprehensive HTML report. For this workflow, MultiQC compiles and visualizes quality control metrics for raw reads, trimmed reads, BUSCO (Benchmarking Universal Single-Copy Orthologs) assessments, and Salmon quantification results. This consolidated report provides an easy-to-navigate overview of the data quality and processing steps, facilitating quick identification of any issues and ensuring that the dataset is ready for downstream analyses.\n",
     "\n",
     "### pipeline_info\n",
-    "> One of the benefits of using Nexflow and a well-defined pipeline/workflow is that when the run is completed, we get a high-level summary of the execution timeline and resources. Two key files within this sub-directory are `transpi_timeline.html` and `transpi_report.html`. In the `transpi_timeline.html` file, you can see a graphic representation of the total execution time of the entire workflow, along with where in the process each of the programs was active. From this diagram, you can also infer the ***dependency*** ordering that is encoded into the TransPi workflow. For example, none of the assembly runs started until the process labeled **`normalize reads`** was complete because each of these is run on the normalized data, rather than the raw input. Similarly, **`evigene`**, the program that integrates and refines the output of all of the assembly runs doesn't start until all of the assembly processes are complete. Within the `transpi_report.html` file, you can get a view of the resources used and activities carried out by each process, including CPUs, RAM, input/output, container used, and more.\n",
+    "> One of the benefits of using Nexflow and a well-defined pipeline/workflow is that when the run is completed, we get a high-level summary of the execution timeline and resources. Two key files within this sub-directory are `execution_timeline.html` and `execution_report.html`. In the `execution_timeline.html` file, you can see a graphic representation of the total execution time of the entire workflow, along with where in the process each of the programs was active. From this diagram, you can also infer the ***dependency*** ordering that is encoded into the de novo transcriptome workflow. For example, **`evigene`**, the program that integrates and refines the output of all of the assembly runs doesn't start until all of the assembly processes are complete. Within the `execution_report.html` file, you can get a view of the resources used and activities carried out by each process, including CPUs, RAM, input/output, container used, and more.\n",
     "\n",
     "### RUN-INFO.txt\n",
     "> `RUN-INFO.txt` provides the specific details of the run such as where the directories are and the versions of the various programs used."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e3dd8a45",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "id": "c372f902-6138-4217-86a8-4b0002f5f387",