openproblems-bio
diff --git a/‎.github/workflows/quarto_netlify.yml‎
Lines changed: 3 additions & 1 deletion b/‎.github/workflows/quarto_netlify.yml‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 6 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎documentation/create_task/create_workflow.qmd‎
Lines changed: 137 additions & 5 deletions b/‎documentation/create_task/create_workflow.qmd‎
Lines changed: 137 additions & 5 deletions
diff --git a/‎results/_include/_baseline_descriptions.qmd‎
Lines changed: 4 additions & 4 deletions b/‎results/_include/_baseline_descriptions.qmd‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎results/_include/_method_descriptions.qmd‎
Lines changed: 4 additions & 4 deletions b/‎results/_include/_method_descriptions.qmd‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎results/_include/_metric_descriptions.qmd‎
Lines changed: 4 additions & 4 deletions b/‎results/_include/_metric_descriptions.qmd‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎results/predict_modality/data/dataset_info.json‎
Lines changed: 68 additions & 0 deletions b/‎results/predict_modality/data/dataset_info.json‎
Lines changed: 68 additions & 0 deletions
@@ -28,7 +28,9 @@ jobs:
           use-public-rspm: true
 
       - name: Install system dependencies for igraph
-        run: sudo apt-get update && sudo apt-get install -y libglpk40
+        run: |
+          sudo apt-get update && \
+            sudo apt-get install -y libglpk40 libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev
 
       - name: Set up environment
         run: |
 
@@ -30,6 +30,12 @@
 
 ## NEW CONTENT
 
+* Add Predict Modality benchmark page (PR #320).
+
+# openproblems.bio v2.3.6
+
+## NEW CONTENT
+
 * Add an event page for the Weekly wednesday work meeting (PR #299).
 
 * Add `Advanced_topics` pages to documentation (PR #300).
 
@@ -5,10 +5,142 @@ engine: knitr
 page-navigation: true
 ---
 
-Once dataset processor, method and metric components are created, it is time to combine them into a workflow. A workflow is a sequence of components that are executed in a specific order.
+Up to this point, you've seen how OpenProblems uses modular components — dataset loaders, methods, and metrics — to define benchmarking tasks for single-cell analysis. Now, we'll bring these components together into a complete, executable workflow using Viash. By creating a Viash Nextflow component, we can orchestrate the execution of these modules, making it simple to run a full benchmark, manage dependencies, and generate comprehensive results. 
 
-:::{.callout-important}
-This page is still under construction.
+Let's break down how we build a Nextflow workflow component using Viash. The core idea is to define a `config.vsh.yaml` file that describes our workflow's inputs, outputs, dependencies, and execution logic, and then write a `main.nf` script containing the Nextflow workflow itself. Viash will take care of the rest, automatically generating the necessary boilerplate code and wiring everything together.
 
-For now, this step is not an essential step of creating a new task before submitting it for review.
-:::
+## Viash Config for Nextflow Workflows (`config.vsh.yaml`)
+
+The `config.vsh.yaml` file is where we specify the blueprint of our workflow. 
+Here's a breakdown of the key sections:
+
+```{.yaml filename="src/workflows/run_benchmark/config.vsh.yaml"}
+name: run_benchmark # <1>
+namespace: workflows
+
+argument_groups:
+  - name: Inputs # <2>
+    arguments:
+      - name: "--input_train"
+        __merge__: /src/api/file_train.yaml
+        type: file
+        direction: input
+        required: true
+      - ...
+  - name: Outputs # <3>
+    arguments:
+      - name: "--output_scores"
+        type: file
+        required: true
+        direction: output
+        description: A yaml file containing the scores of each of the methods
+        default: score_uns.yaml
+      - ...
+
+resources: # <4>
+  - type: nextflow_script
+    path: main.nf
+    entrypoint: run_wf
+
+dependencies: # <5>
+  - name: h5ad/extract_uns_metadata
+  - name: methods/logistic_regression
+  - name: metrics/accuracy
+  - name: control_methods/true_labels
+
+runners: # <6>
+  - type: nextflow
+```
+
+1. **`name` and `namespace`:** These fields, as with other component types, uniquely identify your workflow component within the Viash ecosystem.
+2. **Input arguments:** Here we define the input files our workflow will consume. These typically consists of all of the files created by the dataset processor defined in the previous page.
+3. **Output arguments:** The outputs are where the results of your benchmarking workflow will be stored. For benchmarks, this should always be the following. `output_scores`: A yaml containing the scores of the methods on the datasets. `output_method_configs`: A yaml containing the Viash configs of the methods used in the benchmark. `output_metric_configs`: A yaml containing the Viash configs of the metrics used in the benchmark. `output_dataset_info`: A yaml containing the metadata of the datasets used in the benchmark. `output_task_info`: A yaml containing the metadata of the benchmark task itself.
+4. **`resources`:** This section is where you point Viash to your Nextflow workflow definition, which is typically stored in a file named `main.nf` with an entry point named `run_wf`.
+5. **`dependencies`:** This is what makes workflow components truly powerful. Here, you declare the other Viash components that your workflow depends on. These can be methods, metrics, or other utility components. You need to specify each dependency by name, and optionally its repository if it's not a component from this repository.
+6. **`runners`:** This tells Viash that it should use Nextflow to run this component.
+
+
+
+### The Nextflow Workflow (`main.nf`)
+
+The `main.nf` file contains the actual Nextflow workflow that orchestrates the execution of your components. Let's dissect the key parts of this file:
+
+```{.groovy filename="src/workflows/run_benchmark/main.nf"}
+// <1>
+
+workflow run_wf { // <2>
+  take:
+  input_ch // <3>
+
+  main:
+
+  dataset_ch = input_ch
+    | map{ id, state -> 
+      [id, state + ["_meta": [join_id: id]]]
+    }
+    | extract_uns_metadata.run( // <4>
+      fromState: [input: "input_solution"],
+      toState: { id, output, state ->
+        state + [
+          dataset_uns: readYaml(output.output).uns
+        ]
+      }
+    )
+
+  methods = [ // <5>
+    true_labels,
+    logistic_regression
+  ]
+  metrics = [
+    accuracy
+  ]
+
+  score_ch = dataset_ch // <6>
+    // run all methods
+    | runEach(
+      components: methods,
+      // ...
+    )
+    // run all metrics
+    | runEach(
+      components: metrics,
+      // ...
+    )
+  
+  output_ch = ... // <7>
+
+  emit:
+  output_ch
+}
+
+```
+
+1. **Auto-Generated Code (not shown):** When you build your workflow component with Viash, it automatically injects some boilerplate code at the beginning of your `main.nf`. This code handles tasks like finding the components you declared as dependencies and setting up the initial data flow.
+2. **Workflow Structure:** Your main workflow logic will typically be enclosed within a `workflow` block, named `run_wf` as we specified in our `config.vsh.yaml`. This is the entry point for your workflow.
+3. **Channels: The Data Flow Backbone** Nextflow uses channels to pass data between different processes (which, in our case, are the Viash components). Channels are like asynchronous queues, allowing components to operate independently and concurrently. Viash workflows use a specific convention for the data passed through channels. Each element in a channel is a tuple `[id, state]`, where `id` is a unique identifier for a particular data instance, and `state` is a dictionary key-value dictionary that holds all of the data and metadata associated with that `id`.
+4. **Component Execution with `.run()`:**  Viash enhances components with a `.run()` method. While a regular component is executed as a process in a nextflow workflow directly, `.run()` provides more fine-grained control. You can specify which data gets passed to the component using the `fromState` argument, and how the component's output gets merged back into the state using the `toState` argument. This is a powerful way to customize how components interact with the workflow's data.
+5. **Method and Metric Components:** Here, we define the methods and metrics we want to use in our benchmark. These are the components we declared as dependencies in our `config.vsh.yaml`. We can pass these components directly to the `methods` and `metrics` variables, and then use them in the workflow as needed.
+6. **Running Components with `runEach()`:** The `runEach()` function takes a list of components and runs them in parallel, passing the data through the channels. This is where the benchmarking actually happens. We run each method on each dataset, then run each metric on the results, and finally collect the scores.
+7. **Boilerplate**: The rest of the workflow is typically boilerplate code. This will be moved to a separate helper file to avoid code duplication.
+
+
+
+## Test Run
+
+To test your workflow, you can use the provided `scripts/run_benchmark/run_test_local.sh` script. This script is designed to run the `run_benchmark` workflow on a small test dataset.
+
+1. **Build All Docker Containers**: Before running the script, make sure you have built all Docker containers for your components by running `scripts/project/build_all_docker_containers.sh`. This will ensure that the necessary dependencies are available when the workflow runs.
+
+2. **Edit run_test_local.sh**: You'll need to modify the script to match your workflow's inputs and outputs. The relevant lines are currently marked with TODO comments.
+
+3. **Run the Script**: Once you've edited the script, run it from the root of the repository with `./scripts/run_benchmark/run_test_local.sh`.
+
+The script will execute the `run_benchmark` workflow using Nextflow and the Docker profile. After the script finishes, you can examine the output files in the specified `publish_dir` to verify that your workflow is working correctly.
+
+## Next Steps
+
+Now that you've created a complete benchmarking workflow, you can run it on your own datasets and methods. You can also extend the workflow to include additional metrics, methods, or other components.
+
+We can now run the workflow on the OpenProblems cloud infrastructure. This will allow us to benchmark our methods on a larger scale and generate comprehensive results. Get in touch to get access to the platform for launching your workflows.
+
+Once the workflow has finished running and has been reviewed by the results QC team, the results will be published on the OpenProblems website.
@@ -6,18 +6,18 @@ lines <- pmap_chr(baselines, function(method_name, method_summary, method_descri
   image <- pluck(rest, "image", .default = NULL)
   documentation_url <- pluck(rest, "documentation_url", .default = NULL)
   code_version <- pluck(rest, "code_version", .default = NULL)
-  references_doi <- pluck(rest, "references_doi", .default = NULL)
-  references_bibtex <- pluck(rest, "references_bibtex", .default = NULL)
+  references_doi <- pluck(rest, "references_doi", .default = NULL) |> na.omit()
+  references_bibtex <- pluck(rest, "references_bibtex", .default = NULL) |> na.omit()
 
   ref <-
     if ("paper_reference" %in% names(rest)) {
       split_cite_fun(rest$paper_reference)
     } else {
       bibs <- c()
-      if (!is.null(references_doi) && !is.na(references_doi)) {
+      if (!is.null(references_doi) && length(references_doi) != 0) {
         bibs <- get_bibtex_from_doi(references_doi)
       }
-      if (!is.null(references_bibtex) && !is.na(references_bibtex)) {
+      if (!is.null(references_bibtex) && length(references_bibtex) != 0) {
         bibs <- c(bibs, references_bibtex)
       }
       # Write new entries to library.bib
 
@@ -6,18 +6,18 @@ lines <- pmap_chr(method_info %>% filter(!is_baseline), function(method_name, me
   image <- pluck(rest, "image", .default = NULL)
   documentation_url <- pluck(rest, "documentation_url", .default = NULL)
   code_version <- pluck(rest, "code_version", .default = NULL)
-  references_doi <- pluck(rest, "references_doi", .default = NULL)
-  references_bibtex <- pluck(rest, "references_bibtex", .default = NULL)
+  references_doi <- pluck(rest, "references_doi", .default = NULL) |> na.omit()
+  references_bibtex <- pluck(rest, "references_bibtex", .default = NULL) |> na.omit()
 
   ref <-
     if ("paper_reference" %in% names(rest)) {
       split_cite_fun(rest$paper_reference)
     } else {
       bibs <- c()
-      if (!is.null(references_doi) && !is.na(references_doi)) {
+      if (!is.null(references_doi) && length(references_doi) != 0) {
         bibs <- get_bibtex_from_doi(references_doi)
       }
-      if (!is.null(references_bibtex) && !is.na(references_bibtex)) {
+      if (!is.null(references_bibtex) && length(references_bibtex) != 0) {
         bibs <- c(bibs, references_bibtex)
       }
       # Write new entries to library.bib
 
@@ -4,18 +4,18 @@ lines <- pmap_chr(metric_info, function(metric_name, metric_summary, metric_desc
   rest <- list(...)
   image <- pluck(rest, "image", .default = NULL)
   code_version <- pluck(rest, "code_version", .default = NULL)
-  references_doi <- pluck(rest, "references_doi", .default = NULL)
-  references_bibtex <- pluck(rest, "references_bibtex", .default = NULL)
+  references_doi <- pluck(rest, "references_doi", .default = NULL) |> na.omit()
+  references_bibtex <- pluck(rest, "references_bibtex", .default = NULL) |> na.omit()
   
   ref <-
     if ("paper_reference" %in% names(rest)) {
       split_cite_fun(rest$paper_reference)
     } else {
       bibs <- c()
-      if (!is.null(references_doi) && !is.na(references_doi)) {
+      if (!is.null(references_doi) && length(references_doi) != 0) {
         bibs <- get_bibtex_from_doi(references_doi)
       }
-      if (!is.null(references_bibtex) && !is.na(references_bibtex)) {
+      if (!is.null(references_bibtex) && length(references_bibtex) != 0) {
         bibs <- c(bibs, references_bibtex)
       }
       # Write new entries to library.bib
 
@@ -0,0 +1,68 @@
+[
+  {
+    "dataset_id": "openproblems_neurips2021/bmmc_cite/normal",
+    "dataset_name": "NeurIPS2021 CITE-Seq (GEX2ADT)",
+    "dataset_summary": "Single-cell CITE-Seq (GEX+ADT) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
+    "dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X 3 prime Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
+    "data_reference": "luecken2021neurips",
+    "data_url": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122",
+    "date_created": "25-11-2024",
+    "file_size": 704994,
+    "common_dataset_id": "openproblems_neurips2021/bmmc_cite"
+  },
+  {
+    "dataset_id": "openproblems_neurips2021/bmmc_multiome/normal",
+    "dataset_name": "NeurIPS2021 Multiome (GEX2ATAC)",
+    "dataset_summary": "Single-cell Multiome (GEX+ATAC) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
+    "dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X Multiome Gene Expression and Chromatin Accessibility kit. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
+    "data_reference": "luecken2021neurips",
+    "data_url": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122",
+    "date_created": "25-11-2024",
+    "file_size": 31080807,
+    "common_dataset_id": "openproblems_neurips2021/bmmc_multiome"
+  },
+  {
+    "dataset_id": "openproblems_neurips2021/bmmc_multiome/swap",
+    "dataset_name": "NeurIPS2021 Multiome (ATAC2GEX)",
+    "dataset_summary": "Single-cell Multiome (GEX+ATAC) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
+    "dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X Multiome Gene Expression and Chromatin Accessibility kit. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
+    "data_reference": "luecken2021neurips",
+    "data_url": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122",
+    "date_created": "25-11-2024",
+    "file_size": 7883109,
+    "common_dataset_id": "openproblems_neurips2021/bmmc_multiome"
+  },
+  {
+    "dataset_id": "openproblems_neurips2022/pbmc_cite/normal",
+    "dataset_name": "OpenProblems NeurIPS2022 CITE-Seq (GEX2ADT)",
+    "dataset_summary": "Single-cell CITE-Seq (GEX+ADT) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
+    "dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X 3 prime Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2022. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
+    "data_reference": "lance2024predicting",
+    "data_url": "https://www.kaggle.com/competitions/open-problems-multimodal/data",
+    "date_created": "25-11-2024",
+    "file_size": 591886,
+    "common_dataset_id": "openproblems_neurips2022/pbmc_cite"
+  },
+  {
+    "dataset_id": "openproblems_neurips2022/pbmc_cite/swap",
+    "dataset_name": "OpenProblems NeurIPS2022 CITE-Seq (ADT2GEX)",
+    "dataset_summary": "Single-cell CITE-Seq (GEX+ADT) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
+    "dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X 3 prime Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2022. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
+    "data_reference": "lance2024predicting",
+    "data_url": "https://www.kaggle.com/competitions/open-problems-multimodal/data",
+    "date_created": "25-11-2024",
+    "file_size": 32551804,
+    "common_dataset_id": "openproblems_neurips2022/pbmc_cite"
+  },
+  {
+    "dataset_id": "openproblems_neurips2021/bmmc_cite/swap",
+    "dataset_name": "NeurIPS2021 CITE-Seq (ADT2GEX)",
+    "dataset_summary": "Single-cell CITE-Seq (GEX+ADT) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
+    "dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X 3 prime Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
+    "data_reference": "luecken2021neurips",
+    "data_url": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122",
+    "date_created": "25-11-2024",
+    "file_size": 13467880,
+    "common_dataset_id": "openproblems_neurips2021/bmmc_cite"
+  }
+]