Updated Submodule 3 and fixed typo in submod 2

aolveraNIH · aolveraNIH · commit de60521d5df1 · 2025-06-09T11:53:48.000-05:00
diff --git a/AWS/Submodule_2_annotation_only.ipynb b/AWS/Submodule_2_annotation_only.ipynb
@@ -663,15 +663,70 @@
    "source": [
     "## Conclusion\n",
     "\n",
-    "This notebook provided a comprehensive hands-on experience in transcriptome annotation using the `denovoscript` pipeline in annotation-only mode, leveraging AWS Batch for serverless execution and Docker containers for BUSCO analysis. Through a guided workflow, users learned to set up AWS Batch, execute `denovoscript` to annotate a rainbow trout transcriptome, assess transcriptome completeness with BUSCO, and critically interpret the results from BUSCO, GO, and TransDecoder analyses. Furthermore, the notebook emphasized the importance of understanding data provenance and culminated in an independent BUSCO analysis exercise, challenging users to apply their newfound skills to different transcriptomes and critically evaluate the outcomes, thus solidifying their understanding of transcriptome assembly and annotation principles."
+    "This notebook provided a comprehensive hands-on experience in transcriptome annotation using the `denovoscript` pipeline in annotation-only mode, leveraging AWS Batch for serverless execution and Docker containers for BUSCO analysis. Through a guided workflow, users learned to set up AWS Batch, execute `denovoscript` to annotate a rainbow trout transcriptome, assess transcriptome completeness with BUSCO, and critically interpret the results from BUSCO, GO, and TransDecoder analyses. Furthermore, the notebook emphasized the importance of understanding data provenance and culminated in an independent BUSCO analysis exercise, challenging users to apply their newfound skills to different transcriptomes and critically evaluate the outcomes, thus solidifying their understanding of transcriptome assembly and annotation principles.\n",
+    "\n",
+    "\n",
+    "### Why Use AWS Batch?\n",
+    "<table border=\"1\" cellpadding=\"8\" cellspacing=\"0\">\n",
+    "  <thead>\n",
+    "    <tr>\n",
+    "      <th>Benefit</th>\n",
+    "      <th>Explanation</th>\n",
+    "    </tr>\n",
+    "  </thead>\n",
+    "  <tbody>\n",
+    "    <tr>\n",
+    "      <td><strong>Scalability</strong></td>\n",
+    "      <td>Process large MeRIP-seq datasets with multiple jobs in parallel</td>\n",
+    "    </tr>\n",
+    "    <tr>\n",
+    "      <td><strong>Reproducibility</strong></td>\n",
+    "      <td>Ensures the exact same Docker containers and config are used every time</td>\n",
+    "    </tr>\n",
+    "    <tr>\n",
+    "      <td><strong>Ease of Management</strong></td>\n",
+    "      <td>No need to manually manage EC2 instances or storage mounts</td>\n",
+    "    </tr>\n",
+    "    <tr>\n",
+    "      <td><strong>Integration with S3</strong></td>\n",
+    "      <td>Input/output seamlessly handled via S3 buckets</td>\n",
+    "    </tr>\n",
+    "  </tbody>\n",
+    "</table>\n",
+    "\n",
+    "Running on AWS Batch is ideal when your dataset grows beyond what your local notebook or server can handleor when you want reproducible, cloud-native workflows that are easier to scale, share, and manage."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "5bc80021",
    "metadata": {},
    "source": [
-    "## Clean Up\n",
+    "## Clean Up the AWS Environment\n",
+    "\n",
+    "Once you've successfully run your analysis and downloaded the results, it's a good idea to clean up unused resources to avoid unnecessary charges.\n",
+    "\n",
+    "#### Recommended Cleanup Steps:\n",
+    "\n",
+    "- **Delete Output Files from S3 (Optional)**  \n",
+    "    If you've downloaded your results locally and no longer need them stored in the cloud.\n",
+    "- **Delete the S3 Bucket (Optional)**    \n",
+    "  To remove the entire bucket (only do this if you're sure!)\n",
+    "- **Shut Down AWS Batch Resources (Optional but Recommended):**    \n",
+    "  If you used a CloudFormation stack to set up AWS Batch, you can delete all associated resources in one step (⚠️ Note: Deleting the stack will also remove IAM roles and compute environments created by the template.):\n",
+    "  + Go to the <a href=\"https://console.aws.amazon.com/cloudformation/\">AWS CloudFormation Console</a>\n",
+    "  + Select your stack (e.g., <code>aws-batch-nigms-test1</code>)\n",
+    "  + Click Delete\n",
+    "  + Wait for all resources (compute environments, roles, queues) to be removed\n",
+    "  \n",
+    "<div style=\"border: 1px solid #659078; padding: 0px; border-radius: 4px;\">\n",
+    "  <div style=\"background-color: #d4edda; padding: 5px; font-weight: bold;\">\n",
+    "    <i class=\"fas fa-lightbulb\" style=\"color: #0e4628;margin-right: 5px;\"></i><a style=\"color: #0e4628\">Tips</a>\n",
+    "  </div>\n",
+    "  <p style=\"margin-left: 5px;\">\n",
+    "It’s always good practice to periodically review your <b>EC2 instances</b>, <b>ECR containers</b>, <b>S3 storage</b>, and <b>CloudWatch logs</b> to ensure no stray resources are incurring charges.\n",
+    "  </p>\n",
+    "</div>\n",
     "\n",
     "Remember to proceed to the next notebook [`Submodule_04_gls_assembly.ipynb`](Submodule_04_gls_assembly.ipynb) or shut down your instance if you are finished."
    ]
diff --git a/AWS/Submodule_3_basic_assembly.ipynb b/AWS/Submodule_3_basic_assembly.ipynb
@@ -128,41 +128,63 @@
     "! aws s3 ls s3://nigms-sandbox/nosi-inbremaine-storage/resources/seq2/"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "5e8dd0c5",
+   "metadata": {},
+   "source": [
+    "***If you have not set up AWS Batch please proceed to Step 2, otherwise proceed to Step 3.***"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "7a87b0d2",
    "metadata": {},
    "source": [
-    "### **Step 2:** AWS Batch Setup\n",
-    "\n",
-    "AWS Batch will create the needed permissions, roles and resources to run Nextflow in a serverless manner. You can set up AWS Batch manually or deploy it **automatically** with a stack template. The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. \n",
+    "### **Step 2:** Setting up AWS Batch \n",
     "\n",
-    "If you prefer to skip manual deployment and deploy automatically in the cloud, click the Launch Stack button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. \n",
+    "AWS Batch manages the provisioning of compute environments (EC2, Fargate), container orchestration, job queues, IAM roles, and permissions. We can deploy a full environment either:\n",
+    "- Automatically using a preconfigured AWS CloudFormation stack (**recommended**)\n",
+    "- Manually by setting up roles, queues, and buckets\n",
+    "The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. \n",
     "\n",
-    "[![Launch Stack](../images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml)\n",
+    "If you prefer to skip manual deployment and deploy automatically in the cloud, click the **Launch Stack** button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. \n",
     "\n",
-    "\n",
-    "Before beginning this tutorial, if you do not have required roles, policies, permissions or compute environment and would like to **manually** set those up please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/AWS-Batch-Setup.md) to set that up."
+    "[![Launch Stack](../images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml )"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "413ac931",
+   "id": "1f55c633",
    "metadata": {},
    "source": [
-    "#### Change the parameters as desired in `aws` profile inside `../denovotrascript/nextflow.config` file:\n",
-    " - Name of your **AWS Batch Job Queue**\n",
-    " - AWS region \n",
-    " - Nextflow work directory\n",
-    " - Nextflow output directory"
+    "### **Step 3:** Install dependencies, update paths and create a new S3 Bucket to store input and output files\n",
+    "\n",
+    "After setting up an AWS CloudFormation stack, we need to let the nextflow workflow to know where are those resrouces by providing the configuration:\n",
+    "<div style=\"border: 1px solid #e57373; padding: 0px; border-radius: 4px;\">\n",
+    "  <div style=\"background-color: #ffcdd2; padding: 5px; \">\n",
+    "    <i class=\"fas fa-exclamation-triangle\" style=\"color: #b71c1c;margin-right: 5px;\"></i><a style=\"color: #b71c1c\"><b>Important</b> - Customize Required</a>\n",
+    "  </div>\n",
+    "  <p style=\"margin-left: 5px;\">\n",
+    "After successfull creation of your stack you must attatch a new role to SageMaker to be able to submit batch jobs. Please following the the following steps to change your SageMaker role:<br>\n",
+    "<ol> <li>Navigate to your SageMaker AI notebook dashboard (where you initially created and launched your VM)</li> <li>Locate your instance and click the <b>Stop</b> button</li> <li>Once the instance is stopped: <ul> <li>Click <b>Edit</b></li> <li>Scroll to the \"Permissions and encryption\" section</li> <li>Click the IAM role dropdown</li> <li>Select the new role created during stack formation (named something like <b>aws-batch-nigms-SageMakerExecutionRole</b>)</li> </ul> </li> \n",
+    "<li>Click <b>Update notebook instance</b> to save your changes</li> \n",
+    "<li>After the update completes: <ul> <li>Click <b>Start</b> to relaunch your instance</li> <li>Reconnect to your instance</li> <li>Resume your work from this point</li> </ul> </li> </ol>\n",
+    "\n",
+    "<b>Warning:</b> Make sure to replace the <b>stack name</b> to the stack that you just created. <code>STACK_NAME = \"your-stack-name-here\"</code>\n",
+    "  </p>\n",
+    "</div>"
    ]
   },
   {
-   "cell_type": "markdown",
-   "id": "1f55c633",
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0da9939e",
    "metadata": {},
+   "outputs": [],
    "source": [
-    "### **Step 3:** Install Nextflow"
+    "# define a stack name variable\n",
+    "STACK_NAME = \"aws-batch-nigms-test1\""
    ]
   },
   {
@@ -172,55 +194,113 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%%capture\n",
-    "! mamba create  -n nextflow -c bioconda nextflow -y\n",
-    "! mamba install -n nextflow ipykernel -y"
+    "import boto3\n",
+    "# Get account ID and region \n",
+    "account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
+    "region = boto3.session.Session().region_name"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b52c37c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Set variable names \n",
+    "# These variables should come from the Intro AWS Batch tutorial (or leave as-is if using the launch stack button)\n",
+    "BUCKET_NAME = f\"{STACK_NAME}-batch-bucket-{account_id}\"\n",
+    "AWS_QUEUE = f\"{STACK_NAME}-JobQueue\"\n",
+    "INPUT_FOLDER = 'nigms-sandbox/nosi-inbremaine-storage/'\n",
+    "AWS_REGION = region"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "bcb1fe5e",
+   "id": "8fce8e92",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-block alert-danger\">\n",
-    "    <i class=\"fa fa-exclamation-circle\" aria-hidden=\"true\"></i>\n",
-    "    <b>Alert: </b> Remember to change your kernel to <b>conda_nextflow</b> to run nextflow.\n",
-    "</div>"
+    "#### Install dependencies\n",
+    "Installs Nextflow and Java, which are required to execute the pipeline. In environments like SageMaker, Java is usually pre-installed. But if you're running outside SageMaker (e.g., EC2 or local), you’ll need to manually install it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7625b33",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install Nextflow\n",
+    "! mamba install -y -c conda-forge -c bioconda nextflow --quiet"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "72d1a3b8",
+   "id": "80b91ef0",
    "metadata": {},
    "source": [
-    "### **Step 4:** Run `denovotranscript`"
+    "<details>\n",
+    "<summary>Install Java and Nextflow if needed in other systems</summary>\n",
+    "If using other system other than AWS SageMaker Notebook, you might need to install java and nextflow using the code below:\n",
+    "<br> <i># Install java</i><pre>\n",
+    "    sudo apt update\n",
+    "    sudo apt-get install default-jdk -y\n",
+    "    java -version\n",
+    "    </pre>\n",
+    "    <i># Install Nextflow</i><pre>\n",
+    "    curl https://get.nextflow.io | bash\n",
+    "    chmod +x nextflow\n",
+    "    ./nextflow self-update\n",
+    "    ./nextflow plugin update\n",
+    "    </pre>\n",
+    "</details>"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "ee5985e3-93df-4779-afe1-4464e13bf619",
+   "id": "61f28ac4",
    "metadata": {},
    "outputs": [],
    "source": [
-    "! nextflow run main.nf --input test_samplesheet.csv -profile aws --run_mode full"
+    "# replace batch bucket name in nextflow configuration file\n",
+    "! sed -i \"s/aws-batch-nigms-batch-bucket-/$BUCKET_NAME/g\" ../denovotranscipt/nextflow.config\n",
+    "# replace job queue name in configuration file \n",
+    "! sed -i \"s/aws-batch-nigms-JobQueue/$AWS_QUEUE/g\" ../denovotranscipt/nextflow.config\n",
+    "# replace the region placeholder with the region you are in \n",
+    "! sed -i \"s/aws-region/$AWS_REGION/g\" ../denovotranscipt/nextflow.config"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "0117d994-0502-4a58-b07a-861d254f11e2",
+   "id": "72d1a3b8",
    "metadata": {},
    "source": [
-    "The beauty and power of using a defined workflow in a management system (such as Nextflow) are that we not only get a defined set of steps that are carried out in the proper order, but we also get a well-structured and concise directory structure that holds all pertinent output."
+    "### **Step 4:** Enable AWS Batch for the nextflow script `denovotranscript`"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "5ad70acb",
+   "id": "92fb30de",
+   "metadata": {},
+   "source": [
+    "Run the pipeline in a cloud-native, serverless manner using AWS Batch. AWS Batch offloads the burden of provisioning and managing compute resources. When you execute this command:\n",
+    "- Nextflow uploads tasks to AWS Batch. \n",
+    "- AWS Batch pulls the necessary containers.\n",
+    "- Each process/task in the pipeline runs as an isolated job in the cloud.\n",
+    "\n",
+    "The beauty and power of using a defined workflow in a management system (such as Nextflow) are that we not only get a defined set of steps that are carried out in the proper order, but we also get a well-structured and concise directory structure that holds all pertinent output.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ee5985e3-93df-4779-afe1-4464e13bf619",
    "metadata": {},
+   "outputs": [],
    "source": [
-    "---\n",
-    "# Andrea, please update the rest for result"
+    "! nextflow run main.nf --input test_samplesheet.csv -profile aws --run_mode full"
    ]
   },
   {
@@ -238,7 +318,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "! aws s3 ls s3://<YOUR-BUCKET-NAME>/<Your-Output-Directory>/"
+    "! aws s3 ls s3://$BUCKET_NAME/nextflow_output/"
    ]
   },
   {
@@ -386,17 +466,67 @@
    "id": "909f6112",
    "metadata": {},
    "source": [
-    "## Conclusion"
+    "## Conclusion: Why Use AWS Batch?\n",
+    "<table border=\"1\" cellpadding=\"8\" cellspacing=\"0\">\n",
+    "  <thead>\n",
+    "    <tr>\n",
+    "      <th>Benefit</th>\n",
+    "      <th>Explanation</th>\n",
+    "    </tr>\n",
+    "  </thead>\n",
+    "  <tbody>\n",
+    "    <tr>\n",
+    "      <td><strong>Scalability</strong></td>\n",
+    "      <td>Process large MeRIP-seq datasets with multiple jobs in parallel</td>\n",
+    "    </tr>\n",
+    "    <tr>\n",
+    "      <td><strong>Reproducibility</strong></td>\n",
+    "      <td>Ensures the exact same Docker containers and config are used every time</td>\n",
+    "    </tr>\n",
+    "    <tr>\n",
+    "      <td><strong>Ease of Management</strong></td>\n",
+    "      <td>No need to manually manage EC2 instances or storage mounts</td>\n",
+    "    </tr>\n",
+    "    <tr>\n",
+    "      <td><strong>Integration with S3</strong></td>\n",
+    "      <td>Input/output seamlessly handled via S3 buckets</td>\n",
+    "    </tr>\n",
+    "  </tbody>\n",
+    "</table>\n",
+    "\n",
+    "Running on AWS Batch is ideal when your dataset grows beyond what your local notebook or server can handleor when you want reproducible, cloud-native workflows that are easier to scale, share, and manage."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "b68484f3",
    "metadata": {},
    "source": [
-    "## Clean Up\n",
-    "\n",
-    "Shut down your instance if you are finished."
+    "## Clean Up the AWS Environment\n",
+    "\n",
+    "Once you've successfully run your analysis and downloaded the results, it's a good idea to clean up unused resources to avoid unnecessary charges.\n",
+    "\n",
+    "#### Recommended Cleanup Steps:\n",
+    "\n",
+    "- **Delete Output Files from S3 (Optional)**  \n",
+    "    If you've downloaded your results locally and no longer need them stored in the cloud.\n",
+    "- **Delete the S3 Bucket (Optional)**    \n",
+    "  To remove the entire bucket (only do this if you're sure!)\n",
+    "- **Shut Down AWS Batch Resources (Optional but Recommended):**    \n",
+    "  If you used a CloudFormation stack to set up AWS Batch, you can delete all associated resources in one step (⚠️ Note: Deleting the stack will also remove IAM roles and compute environments created by the template.):\n",
+    "  + Go to the <a href=\"https://console.aws.amazon.com/cloudformation/\">AWS CloudFormation Console</a>\n",
+    "  + Select your stack (e.g., <code>aws-batch-nigms-test1</code>)\n",
+    "  + Click Delete\n",
+    "  + Wait for all resources (compute environments, roles, queues) to be removed\n",
+    "  \n",
+    "<div style=\"border: 1px solid #659078; padding: 0px; border-radius: 4px;\">\n",
+    "  <div style=\"background-color: #d4edda; padding: 5px; font-weight: bold;\">\n",
+    "    <i class=\"fas fa-lightbulb\" style=\"color: #0e4628;margin-right: 5px;\"></i><a style=\"color: #0e4628\">Tips</a>\n",
+    "  </div>\n",
+    "  <p style=\"margin-left: 5px;\">\n",
+    "It’s always good practice to periodically review your <b>EC2 instances</b>, <b>ECR containers</b>, <b>S3 storage</b>, and <b>CloudWatch logs</b> to ensure no stray resources are incurring charges.\n",
+    "  </p>\n",
+    "</div>"
    ]
   }
  ],