Skip to content

Commit b55c074

Browse files
committed
Merge remote-tracking branch 'origin/main' into feature/no-ref/add-sp-sim-results
2 parents f6ae538 + 2fa5fab commit b55c074

16 files changed

Lines changed: 6167 additions & 18 deletions

.github/workflows/quarto_netlify.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,9 @@ jobs:
2828
use-public-rspm: true
2929

3030
- name: Install system dependencies for igraph
31-
run: sudo apt-get update && sudo apt-get install -y libglpk40
31+
run: |
32+
sudo apt-get update && \
33+
sudo apt-get install -y libglpk40 libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev
3234
3335
- name: Set up environment
3436
run: |

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,12 @@
3030

3131
## NEW CONTENT
3232

33+
* Add Predict Modality benchmark page (PR #320).
34+
35+
# openproblems.bio v2.3.6
36+
37+
## NEW CONTENT
38+
3339
* Add an event page for the Weekly wednesday work meeting (PR #299).
3440

3541
* Add `Advanced_topics` pages to documentation (PR #300).

documentation/create_task/create_workflow.qmd

Lines changed: 137 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,142 @@ engine: knitr
55
page-navigation: true
66
---
77

8-
Once dataset processor, method and metric components are created, it is time to combine them into a workflow. A workflow is a sequence of components that are executed in a specific order.
8+
Up to this point, you've seen how OpenProblems uses modular components — dataset loaders, methods, and metrics — to define benchmarking tasks for single-cell analysis. Now, we'll bring these components together into a complete, executable workflow using Viash. By creating a Viash Nextflow component, we can orchestrate the execution of these modules, making it simple to run a full benchmark, manage dependencies, and generate comprehensive results.
99

10-
:::{.callout-important}
11-
This page is still under construction.
10+
Let's break down how we build a Nextflow workflow component using Viash. The core idea is to define a `config.vsh.yaml` file that describes our workflow's inputs, outputs, dependencies, and execution logic, and then write a `main.nf` script containing the Nextflow workflow itself. Viash will take care of the rest, automatically generating the necessary boilerplate code and wiring everything together.
1211

13-
For now, this step is not an essential step of creating a new task before submitting it for review.
14-
:::
12+
## Viash Config for Nextflow Workflows (`config.vsh.yaml`)
13+
14+
The `config.vsh.yaml` file is where we specify the blueprint of our workflow.
15+
Here's a breakdown of the key sections:
16+
17+
```{.yaml filename="src/workflows/run_benchmark/config.vsh.yaml"}
18+
name: run_benchmark # <1>
19+
namespace: workflows
20+
21+
argument_groups:
22+
- name: Inputs # <2>
23+
arguments:
24+
- name: "--input_train"
25+
__merge__: /src/api/file_train.yaml
26+
type: file
27+
direction: input
28+
required: true
29+
- ...
30+
- name: Outputs # <3>
31+
arguments:
32+
- name: "--output_scores"
33+
type: file
34+
required: true
35+
direction: output
36+
description: A yaml file containing the scores of each of the methods
37+
default: score_uns.yaml
38+
- ...
39+
40+
resources: # <4>
41+
- type: nextflow_script
42+
path: main.nf
43+
entrypoint: run_wf
44+
45+
dependencies: # <5>
46+
- name: h5ad/extract_uns_metadata
47+
- name: methods/logistic_regression
48+
- name: metrics/accuracy
49+
- name: control_methods/true_labels
50+
51+
runners: # <6>
52+
- type: nextflow
53+
```
54+
55+
1. **`name` and `namespace`:** These fields, as with other component types, uniquely identify your workflow component within the Viash ecosystem.
56+
2. **Input arguments:** Here we define the input files our workflow will consume. These typically consists of all of the files created by the dataset processor defined in the previous page.
57+
3. **Output arguments:** The outputs are where the results of your benchmarking workflow will be stored. For benchmarks, this should always be the following. `output_scores`: A yaml containing the scores of the methods on the datasets. `output_method_configs`: A yaml containing the Viash configs of the methods used in the benchmark. `output_metric_configs`: A yaml containing the Viash configs of the metrics used in the benchmark. `output_dataset_info`: A yaml containing the metadata of the datasets used in the benchmark. `output_task_info`: A yaml containing the metadata of the benchmark task itself.
58+
4. **`resources`:** This section is where you point Viash to your Nextflow workflow definition, which is typically stored in a file named `main.nf` with an entry point named `run_wf`.
59+
5. **`dependencies`:** This is what makes workflow components truly powerful. Here, you declare the other Viash components that your workflow depends on. These can be methods, metrics, or other utility components. You need to specify each dependency by name, and optionally its repository if it's not a component from this repository.
60+
6. **`runners`:** This tells Viash that it should use Nextflow to run this component.
61+
62+
63+
64+
### The Nextflow Workflow (`main.nf`)
65+
66+
The `main.nf` file contains the actual Nextflow workflow that orchestrates the execution of your components. Let's dissect the key parts of this file:
67+
68+
```{.groovy filename="src/workflows/run_benchmark/main.nf"}
69+
// <1>
70+
71+
workflow run_wf { // <2>
72+
take:
73+
input_ch // <3>
74+
75+
main:
76+
77+
dataset_ch = input_ch
78+
| map{ id, state ->
79+
[id, state + ["_meta": [join_id: id]]]
80+
}
81+
| extract_uns_metadata.run( // <4>
82+
fromState: [input: "input_solution"],
83+
toState: { id, output, state ->
84+
state + [
85+
dataset_uns: readYaml(output.output).uns
86+
]
87+
}
88+
)
89+
90+
methods = [ // <5>
91+
true_labels,
92+
logistic_regression
93+
]
94+
metrics = [
95+
accuracy
96+
]
97+
98+
score_ch = dataset_ch // <6>
99+
// run all methods
100+
| runEach(
101+
components: methods,
102+
// ...
103+
)
104+
// run all metrics
105+
| runEach(
106+
components: metrics,
107+
// ...
108+
)
109+
110+
output_ch = ... // <7>
111+
112+
emit:
113+
output_ch
114+
}
115+
116+
```
117+
118+
1. **Auto-Generated Code (not shown):** When you build your workflow component with Viash, it automatically injects some boilerplate code at the beginning of your `main.nf`. This code handles tasks like finding the components you declared as dependencies and setting up the initial data flow.
119+
2. **Workflow Structure:** Your main workflow logic will typically be enclosed within a `workflow` block, named `run_wf` as we specified in our `config.vsh.yaml`. This is the entry point for your workflow.
120+
3. **Channels: The Data Flow Backbone** Nextflow uses channels to pass data between different processes (which, in our case, are the Viash components). Channels are like asynchronous queues, allowing components to operate independently and concurrently. Viash workflows use a specific convention for the data passed through channels. Each element in a channel is a tuple `[id, state]`, where `id` is a unique identifier for a particular data instance, and `state` is a dictionary key-value dictionary that holds all of the data and metadata associated with that `id`.
121+
4. **Component Execution with `.run()`:** Viash enhances components with a `.run()` method. While a regular component is executed as a process in a nextflow workflow directly, `.run()` provides more fine-grained control. You can specify which data gets passed to the component using the `fromState` argument, and how the component's output gets merged back into the state using the `toState` argument. This is a powerful way to customize how components interact with the workflow's data.
122+
5. **Method and Metric Components:** Here, we define the methods and metrics we want to use in our benchmark. These are the components we declared as dependencies in our `config.vsh.yaml`. We can pass these components directly to the `methods` and `metrics` variables, and then use them in the workflow as needed.
123+
6. **Running Components with `runEach()`:** The `runEach()` function takes a list of components and runs them in parallel, passing the data through the channels. This is where the benchmarking actually happens. We run each method on each dataset, then run each metric on the results, and finally collect the scores.
124+
7. **Boilerplate**: The rest of the workflow is typically boilerplate code. This will be moved to a separate helper file to avoid code duplication.
125+
126+
127+
128+
## Test Run
129+
130+
To test your workflow, you can use the provided `scripts/run_benchmark/run_test_local.sh` script. This script is designed to run the `run_benchmark` workflow on a small test dataset.
131+
132+
1. **Build All Docker Containers**: Before running the script, make sure you have built all Docker containers for your components by running `scripts/project/build_all_docker_containers.sh`. This will ensure that the necessary dependencies are available when the workflow runs.
133+
134+
2. **Edit run_test_local.sh**: You'll need to modify the script to match your workflow's inputs and outputs. The relevant lines are currently marked with TODO comments.
135+
136+
3. **Run the Script**: Once you've edited the script, run it from the root of the repository with `./scripts/run_benchmark/run_test_local.sh`.
137+
138+
The script will execute the `run_benchmark` workflow using Nextflow and the Docker profile. After the script finishes, you can examine the output files in the specified `publish_dir` to verify that your workflow is working correctly.
139+
140+
## Next Steps
141+
142+
Now that you've created a complete benchmarking workflow, you can run it on your own datasets and methods. You can also extend the workflow to include additional metrics, methods, or other components.
143+
144+
We can now run the workflow on the OpenProblems cloud infrastructure. This will allow us to benchmark our methods on a larger scale and generate comprehensive results. Get in touch to get access to the platform for launching your workflows.
145+
146+
Once the workflow has finished running and has been reviewed by the results QC team, the results will be published on the OpenProblems website.

results/_include/_baseline_descriptions.qmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,18 @@ lines <- pmap_chr(baselines, function(method_name, method_summary, method_descri
66
image <- pluck(rest, "image", .default = NULL)
77
documentation_url <- pluck(rest, "documentation_url", .default = NULL)
88
code_version <- pluck(rest, "code_version", .default = NULL)
9-
references_doi <- pluck(rest, "references_doi", .default = NULL)
10-
references_bibtex <- pluck(rest, "references_bibtex", .default = NULL)
9+
references_doi <- pluck(rest, "references_doi", .default = NULL) |> na.omit()
10+
references_bibtex <- pluck(rest, "references_bibtex", .default = NULL) |> na.omit()
1111
1212
ref <-
1313
if ("paper_reference" %in% names(rest)) {
1414
split_cite_fun(rest$paper_reference)
1515
} else {
1616
bibs <- c()
17-
if (!is.null(references_doi) && !is.na(references_doi)) {
17+
if (!is.null(references_doi) && length(references_doi) != 0) {
1818
bibs <- get_bibtex_from_doi(references_doi)
1919
}
20-
if (!is.null(references_bibtex) && !is.na(references_bibtex)) {
20+
if (!is.null(references_bibtex) && length(references_bibtex) != 0) {
2121
bibs <- c(bibs, references_bibtex)
2222
}
2323
# Write new entries to library.bib

results/_include/_method_descriptions.qmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,18 @@ lines <- pmap_chr(method_info %>% filter(!is_baseline), function(method_name, me
66
image <- pluck(rest, "image", .default = NULL)
77
documentation_url <- pluck(rest, "documentation_url", .default = NULL)
88
code_version <- pluck(rest, "code_version", .default = NULL)
9-
references_doi <- pluck(rest, "references_doi", .default = NULL)
10-
references_bibtex <- pluck(rest, "references_bibtex", .default = NULL)
9+
references_doi <- pluck(rest, "references_doi", .default = NULL) |> na.omit()
10+
references_bibtex <- pluck(rest, "references_bibtex", .default = NULL) |> na.omit()
1111
1212
ref <-
1313
if ("paper_reference" %in% names(rest)) {
1414
split_cite_fun(rest$paper_reference)
1515
} else {
1616
bibs <- c()
17-
if (!is.null(references_doi) && !is.na(references_doi)) {
17+
if (!is.null(references_doi) && length(references_doi) != 0) {
1818
bibs <- get_bibtex_from_doi(references_doi)
1919
}
20-
if (!is.null(references_bibtex) && !is.na(references_bibtex)) {
20+
if (!is.null(references_bibtex) && length(references_bibtex) != 0) {
2121
bibs <- c(bibs, references_bibtex)
2222
}
2323
# Write new entries to library.bib

results/_include/_metric_descriptions.qmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,18 @@ lines <- pmap_chr(metric_info, function(metric_name, metric_summary, metric_desc
44
rest <- list(...)
55
image <- pluck(rest, "image", .default = NULL)
66
code_version <- pluck(rest, "code_version", .default = NULL)
7-
references_doi <- pluck(rest, "references_doi", .default = NULL)
8-
references_bibtex <- pluck(rest, "references_bibtex", .default = NULL)
7+
references_doi <- pluck(rest, "references_doi", .default = NULL) |> na.omit()
8+
references_bibtex <- pluck(rest, "references_bibtex", .default = NULL) |> na.omit()
99
1010
ref <-
1111
if ("paper_reference" %in% names(rest)) {
1212
split_cite_fun(rest$paper_reference)
1313
} else {
1414
bibs <- c()
15-
if (!is.null(references_doi) && !is.na(references_doi)) {
15+
if (!is.null(references_doi) && length(references_doi) != 0) {
1616
bibs <- get_bibtex_from_doi(references_doi)
1717
}
18-
if (!is.null(references_bibtex) && !is.na(references_bibtex)) {
18+
if (!is.null(references_bibtex) && length(references_bibtex) != 0) {
1919
bibs <- c(bibs, references_bibtex)
2020
}
2121
# Write new entries to library.bib
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
[
2+
{
3+
"dataset_id": "openproblems_neurips2021/bmmc_cite/normal",
4+
"dataset_name": "NeurIPS2021 CITE-Seq (GEX2ADT)",
5+
"dataset_summary": "Single-cell CITE-Seq (GEX+ADT) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
6+
"dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X 3 prime Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
7+
"data_reference": "luecken2021neurips",
8+
"data_url": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122",
9+
"date_created": "25-11-2024",
10+
"file_size": 704994,
11+
"common_dataset_id": "openproblems_neurips2021/bmmc_cite"
12+
},
13+
{
14+
"dataset_id": "openproblems_neurips2021/bmmc_multiome/normal",
15+
"dataset_name": "NeurIPS2021 Multiome (GEX2ATAC)",
16+
"dataset_summary": "Single-cell Multiome (GEX+ATAC) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
17+
"dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X Multiome Gene Expression and Chromatin Accessibility kit. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
18+
"data_reference": "luecken2021neurips",
19+
"data_url": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122",
20+
"date_created": "25-11-2024",
21+
"file_size": 31080807,
22+
"common_dataset_id": "openproblems_neurips2021/bmmc_multiome"
23+
},
24+
{
25+
"dataset_id": "openproblems_neurips2021/bmmc_multiome/swap",
26+
"dataset_name": "NeurIPS2021 Multiome (ATAC2GEX)",
27+
"dataset_summary": "Single-cell Multiome (GEX+ATAC) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
28+
"dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X Multiome Gene Expression and Chromatin Accessibility kit. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
29+
"data_reference": "luecken2021neurips",
30+
"data_url": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122",
31+
"date_created": "25-11-2024",
32+
"file_size": 7883109,
33+
"common_dataset_id": "openproblems_neurips2021/bmmc_multiome"
34+
},
35+
{
36+
"dataset_id": "openproblems_neurips2022/pbmc_cite/normal",
37+
"dataset_name": "OpenProblems NeurIPS2022 CITE-Seq (GEX2ADT)",
38+
"dataset_summary": "Single-cell CITE-Seq (GEX+ADT) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
39+
"dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X 3 prime Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2022. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
40+
"data_reference": "lance2024predicting",
41+
"data_url": "https://www.kaggle.com/competitions/open-problems-multimodal/data",
42+
"date_created": "25-11-2024",
43+
"file_size": 591886,
44+
"common_dataset_id": "openproblems_neurips2022/pbmc_cite"
45+
},
46+
{
47+
"dataset_id": "openproblems_neurips2022/pbmc_cite/swap",
48+
"dataset_name": "OpenProblems NeurIPS2022 CITE-Seq (ADT2GEX)",
49+
"dataset_summary": "Single-cell CITE-Seq (GEX+ADT) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
50+
"dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X 3 prime Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2022. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
51+
"data_reference": "lance2024predicting",
52+
"data_url": "https://www.kaggle.com/competitions/open-problems-multimodal/data",
53+
"date_created": "25-11-2024",
54+
"file_size": 32551804,
55+
"common_dataset_id": "openproblems_neurips2022/pbmc_cite"
56+
},
57+
{
58+
"dataset_id": "openproblems_neurips2021/bmmc_cite/swap",
59+
"dataset_name": "NeurIPS2021 CITE-Seq (ADT2GEX)",
60+
"dataset_summary": "Single-cell CITE-Seq (GEX+ADT) data collected from bone marrow mononuclear cells of 12 healthy human donors.",
61+
"dataset_description": "Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X 3 prime Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.",
62+
"data_reference": "luecken2021neurips",
63+
"data_url": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122",
64+
"date_created": "25-11-2024",
65+
"file_size": 13467880,
66+
"common_dataset_id": "openproblems_neurips2021/bmmc_cite"
67+
}
68+
]

0 commit comments

Comments
 (0)