website/documentation/create_task/dataset_processor.qmd at f8cd056b15cfd08612b90fb9d0ff16453480b910 · openproblems-bio/website · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
title: Dataset processor
order: 50
engine: knitr
page-navigation: true
---

```{r setup-packages, include=FALSE}
library(tidyverse)
library(rlang)

strip_margin <- function(text, symbol = "\\|") {
  str_replace_all(text, paste0("(\n?)[ \t]*", symbol), "\\1")
}
```

{{< include /_include/_clone_repo_task_template.qmd >}}
{{< include /_include/_evaluate_code.qmd >}}

In this section, we'll create the {{< glossary "Dataset processor" >}} component and generate the first {{< glossary "test resources" >}}. The Dataset processor component uses one of the pre-defined {{< glossary "common dataset" label="Common datasets" >}} to create the AnnData files used in your task. These AnnData objects can then serve as the first test resources to be used to test Method and Metric components.

:::{.callout-note}
In the task template, the Dataset processor component is already created. You can find the component in the `src/data_processors/process_dataset` directory. You can use this component as a reference to create your own Dataset processor component or adjust as neeeded.
:::

# Why?

The Dataset processor component splits up the Common dataset into different AnnData objects such that both Method and Metric components can only observe the information specified by the file format specification files. This not only safe-guards against data leakage, but also ensures that the Method component can't accidentally make changes to the data structure that the Metric component relies on.

# How?

The Dataset processor, like any component in OpenProblems, is a {{< glossary Viash component >}}, which consists of a config and a script.

# Step 1: Create the Viash config

Start by creating the Viash config. Luckily, we already created its file format and argument list in the [previous section](design_api.qmd), so we can just import the interface using the `__merge__` field:


:::{.panel-tabset group="language"}

# Python

```{.yaml filename="src/data_processors/process_dataset/config.vsh.yaml"}
__merge__: ../../api/comp_data_processor.yaml #<1>
name: process_dataset #<2>
# arguments: #<3>
#   - name: "--method"
#     type: "string"
#     description: "The process method to assign train/test."
#     choices: ["batch", "random"]
#     default: "batch"
resources: #<4>
  - type: python_script #<5>
    path: script.py
  - path: /common/helper_functions/subset_h5ad_by_format.py #<6>

engines:
  - type: docker #<7>
    image: openproblems/base_python:1

runners:
  - type: executable
  - type: nextflow #<8>
    directives:
      label: [highmem, highcpu, hightime]
```

1. The interface to use for this component.
2. The name of the component. In this case, this should always be set to `process_dataset`.
3. (_optional_) [Arguments](https://viash.io/versioned/0_9_0_RC/reference/config/arguments/) that can be passed to the component.
4. [Resources](https://viash.io/versioned/0_9_0_RC/reference/config/resources/) are files that provide the logic of the component.
5. We'll create the Python script in the next step.
6. (_Optional_) [helper functions](https://viash.io/guide/component/use-helper-functions.html) you can use in your script.
7. A [Docker engine](https://viash.io/guide/component/add-dependencies.html) to make the component reproducible.
8. A [Nextflow runner](https://viash.io/guide/nextflow_vdsl3/create-a-module.html) to allow turning this component into a Nextflow module (and also standalone pipeline).

# R


```{.yaml filename="src/data_processors/process_dataset/config.vsh.yaml"}
__merge__: ../../api/comp_data_processor.yaml #<1>
name: process_dataset #<2>
# arguments: #<3>
#   - name: "--method"
#     type: "string"
#     description: "The process method to assign train/test."
#     choices: ["batch", "random"]
#     default: "batch"
resources: #<4>
  - type: python_script #<5>
    path: script.py
  - path: /common/helper_functions/subset_h5ad_by_format.py #<6>

engines:
  - type: docker #<7>
    image: openproblems/base_python:1

runners:
  - type: executable
  - type: nextflow #<8>
    directives:
      label: [highmem, highcpu, hightime]
```

1. The interface to use for this component.
2. The name of the component. In this case, this should always be set to `process_dataset`.
3. (_optional_) [Arguments](https://viash.io/versioned/0_9_0_RC/reference/config/arguments/) that can be passed to the component.
4. [Resources](https://viash.io/reference/config/functionality/resources/) are files that provide the logic of the component.
5. We'll create the Python script in the next step.
6. (_Optional_) [Helper functions](https://viash.io/guide/component/use-helper-functions.html) you can use in your script.
7. A [Docker engine](https://viash.io/guide/component/add-dependencies.html) to make the component reproducible.
8. A [Nextflow runner](https://viash.io/guide/nextflow_vdsl3/create-a-module.html) to allow turning this component into a Nextflow module (and also standalone pipeline).

:::

# Step 2: Create the script

Next, create the script which will split the common dataset into a train and solution dataset.

:::{.panel-tabset group="language"}
# Python

```{.python filename="src/data_processors/process_dataset/script.py"}
import anndata as ad

## VIASH START
par = {
    "input": "resources_test/common/cxg_mouse_pancreas_atlas/dataset.h5ad",
    "output_dataset": "dataset.h5ad",
    "output_solution": "solution.h5ad",
}
## VIASH END

print(">> Load common dataset", flush=True)
adata = ad.read_h5ad(par["input"])

# <- implement logic for splitting the common dataset
#    into a dataset and a solution object here.

print(">> Create dataset for methods", flush=True)
output_dataset = ad.AnnData(
  # <- add the data you wish to store in the output_dataset object here
)

print(">> Create solution object for metrics", flush=True)
output_solution = ad.AnnData(
  # <- add the data you wish to store in the output_solution object here
)

print(">> Write to disk", flush=True)
output_dataset.write_h5ad(par["output_dataset"])
output_solution.write_h5ad(par["output_solution"])
```

# R

```{.r filename="src/data_processors/process_dataset/script.R"}
library(anndata)

## VIASH START
par <- list(
  input = "resources_test/common/cxg_mouse_pancreas_atlas/dataset.h5ad",
  output_dataset = "dataset.h5ad",
  output_solution = "solution.h5ad",
)
## VIASH END

cat(">> Load common dataset\n")
adata <- anndata::read_h5ad(par[["input"]])

# <- implement logic for splitting the common dataset
#    into a dataset and a solution object here.

cat(">> Create dataset for methods\n")
output_dataset <- anndata::AnnData(
  # <- add the data you wish to store in the output_dataset object here
)

cat(">> Create solution object for metrics\n")
output_solution <- anndata::AnnData(
  # <- add the data you wish to store in the output_solution object here
)

print(">> Write to disk\n")
output_dataset$write_h5ad(par[["output_dataset"]])
output_solution$write_h5ad(par[["output_solution"]])
```

:::

:::{.callout-tip}
You can use helper functions `subset_h5ad_by_format` defined in [`common/helper_functions/subset_h5ad_by_format.py`](https://github.com/openproblems-bio/common_resources/blob/main/helper_functions/subset_h5ad_by_format.py) to easily subset the AnnData to having only the slots specified by the corresponding API files. Examples of tasks where this is used: [Task template](https://github.com/openproblems-bio/task_template/blob/main/src/data_processors/process_dataset/script.py) [Dimensionality reduction](https://github.com/openproblems-bio/task_dimensionality_reduction/blob/main/src/data_processors/process_dataset/script.py).
:::

# Step 3: Create test resources

Next, we'll create part of the test resources by creating a test resource script.
Create a Bash script which runs the `process_dataset` component.

```{.bash filename="scripts/create_datasets/test_resources.sh"}
#!/bin/bash

# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)

# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"

# remove this when you have implemented the script #<1>
echo "TODO: replace the commands in this script with the sequence of components that you need to run to generate test_resources." #<1>
echo "  Inside this script, you will need to place commands to generate example files for each of the 'src/api/file_*.yaml' files." #<1>
exit 1 #<1>

set -e

RAW_DATA=resources_test/common
DATASET_DIR=resources_test/task_template

mkdir -p $DATASET_DIR

# process dataset
echo Running process_dataset
nextflow run . \
  -main-script target/nextflow/workflows/process_datasets/main.nf \
  -profile docker \
  --publish_dir "$DATASET_DIR" \
  --id "pancreas" \
  --input "$RAW_DATA/cxg_mouse_pancreas_atlas/dataset.h5ad" \
  --output_train '$id/train.h5ad' \
  --output_test '$id/test.h5ad' \
  --output_solution '$id/solution.h5ad' \
  --output_state '$id/state.yaml'

# run one method
viash run src/methods/knn/config.vsh.yaml -- \
    --input_train $DATASET_DIR/cxg_mouse_pancreas_atlas/train.h5ad \
    --input_test $DATASET_DIR/cxg_mouse_pancreas_atlas/test.h5ad \
    --output $DATASET_DIR/cxg_mouse_pancreas_atlas/prediction.h5ad

# run one metric
viash run src/metrics/accuracy/config.vsh.yaml -- \
    --input_prediction $DATASET_DIR/cxg_mouse_pancreas_atlas/prediction.h5ad \
    --input_solution $DATASET_DIR/cxg_mouse_pancreas_atlas/solution.h5ad \
    --output $DATASET_DIR/cxg_mouse_pancreas_atlas/score.h5ad

# only run this if you have access to the openproblems-data bucket #<2>
# aws s3 sync --profile op \
#   "$DATASET_DIR" s3://openproblems-data/resources_test/task_template \
#   --delete --dryrun
```

1. Remove this when you have implemented the script.
2. Only run this if you have access to the openproblems-data bucket.

:::{.callout-note}
You'll have to update the arguments to the output arguments set in the API file.
:::

Now run the script to generate the first couple of test resources:

```bash
scripts/create_datasets/test_resources.sh
```