Skip to content

Commit 7687b7b

Browse files
authored
Move datasets to separate repo (#933)
* move datasets to separate repo * move datasets * add readme
1 parent 6b9ff73 commit 7687b7b

119 files changed

Lines changed: 4 additions & 10375 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818

1919
- Moved `src/tasks/spatially_variable_genes` to [`task_spatially_variable_genes`](https://github.com/openproblems-bio/task_spatially_variable_genes) (PR #910).
2020

21+
- Moved `src/datasets` to [`datasets`](https://github.com/openproblems-bio/datasets) (PR #933).
22+
2123
## Major changes
2224

2325
- Update Viash to 0.9.0 (PR #911).

src/datasets/README.md

Lines changed: 2 additions & 218 deletions
Original file line numberDiff line numberDiff line change
@@ -1,219 +1,3 @@
1+
# Datasets
12

2-
- <a href="#common-datasets" id="toc-common-datasets">Common datasets</a>
3-
- <a href="#pipeline-topology" id="toc-pipeline-topology">Pipeline
4-
topology</a>
5-
- <a href="#file-format-api" id="toc-file-format-api">File format API</a>
6-
- <a href="#datasetpcahvg"
7-
id="toc-datasetpcahvg"><code>Dataset+Pca+Hvg</code></a>
8-
- <a href="#normalized-dataset"
9-
id="toc-normalized-dataset"><code>Normalized Dataset</code></a>
10-
- <a href="#datasetpca" id="toc-datasetpca"><code>Dataset+Pca</code></a>
11-
- <a href="#raw-dataset" id="toc-raw-dataset"><code>Raw Dataset</code></a>
12-
- <a href="#component-api" id="toc-component-api">Component API</a>
13-
- <a href="#dataset-loader"
14-
id="toc-dataset-loader"><code>Dataset Loader</code></a>
15-
- <a href="#normalization"
16-
id="toc-normalization"><code>Normalization</code></a>
17-
- <a href="#processor-hvg"
18-
id="toc-processor-hvg"><code>Processor Hvg</code></a>
19-
- <a href="#processor-pca"
20-
id="toc-processor-pca"><code>Processor Pca</code></a>
21-
22-
# Common datasets
23-
24-
## Pipeline topology
25-
26-
``` mermaid
27-
%%| column: screen-inset-shaded
28-
flowchart LR
29-
file_dataset(Dataset+Pca+Hvg)
30-
file_normalized(Normalized Dataset)
31-
file_pca(Dataset+Pca)
32-
file_raw(Raw Dataset)
33-
comp_dataset_loader[/Dataset Loader/]
34-
comp_normalization[/Normalization/]
35-
comp_processor_hvg[/Processor Hvg/]
36-
comp_processor_pca[/Processor Pca/]
37-
file_raw---comp_normalization
38-
file_pca---comp_processor_hvg
39-
file_normalized---comp_processor_pca
40-
comp_dataset_loader-->file_raw
41-
comp_normalization-->file_normalized
42-
comp_processor_hvg-->file_dataset
43-
comp_processor_pca-->file_pca
44-
```
45-
46-
## File format API
47-
48-
### `Dataset+Pca+Hvg`
49-
50-
A normalised data with a PCA embedding and HVG selection
51-
52-
Used in:
53-
54-
- [processor hvg](#processor%20hvg): output (as output)
55-
56-
Slots:
57-
58-
| struct | name | type | description |
59-
|:-------|:-----------------|:--------|:------------------------------------------------------------------------|
60-
| layers | counts | integer | Raw counts |
61-
| layers | normalized | double | Normalised expression values |
62-
| obs | celltype | string | Cell type information |
63-
| obs | batch | string | Batch information |
64-
| obs | tissue | string | Tissue information |
65-
| obs | size_factors | double | The size factors created by the normalisation method, if any. |
66-
| var | hvg | boolean | Whether or not the feature is considered to be a ‘highly variable gene’ |
67-
| var | hvg_score | integer | A ranking of the features by hvg. |
68-
| obsm | X_pca | double | The resulting PCA embedding. |
69-
| varm | pca_loadings | double | The PCA loadings matrix. |
70-
| uns | dataset_id | string | A unique identifier for the dataset |
71-
| uns | normalization_id | string | Which normalization was used |
72-
| uns | pca_variance | double | The PCA variance objects. |
73-
74-
Example:
75-
76-
AnnData object
77-
obs: 'celltype', 'batch', 'tissue', 'size_factors'
78-
var: 'hvg', 'hvg_score'
79-
uns: 'dataset_id', 'normalization_id', 'pca_variance'
80-
obsm: 'X_pca'
81-
varm: 'pca_loadings'
82-
layers: 'counts', 'normalized'
83-
84-
### `Normalized Dataset`
85-
86-
A normalized dataset
87-
88-
Used in:
89-
90-
- [normalization](#normalization): output (as output)
91-
- [processor pca](#processor%20pca): input (as input)
92-
93-
Slots:
94-
95-
| struct | name | type | description |
96-
|:-------|:-----------------|:--------|:--------------------------------------------------------------|
97-
| layers | counts | integer | Raw counts |
98-
| layers | normalized | double | Normalised expression values |
99-
| obs | celltype | string | Cell type information |
100-
| obs | batch | string | Batch information |
101-
| obs | tissue | string | Tissue information |
102-
| obs | size_factors | double | The size factors created by the normalisation method, if any. |
103-
| uns | dataset_id | string | A unique identifier for the dataset |
104-
| uns | normalization_id | string | Which normalization was used |
105-
106-
Example:
107-
108-
AnnData object
109-
obs: 'celltype', 'batch', 'tissue', 'size_factors'
110-
uns: 'dataset_id', 'normalization_id'
111-
layers: 'counts', 'normalized'
112-
113-
### `Dataset+Pca`
114-
115-
A normalised data with a PCA embedding
116-
117-
Used in:
118-
119-
- [processor hvg](#processor%20hvg): input (as input)
120-
- [processor pca](#processor%20pca): output (as output)
121-
122-
Slots:
123-
124-
| struct | name | type | description |
125-
|:-------|:-----------------|:--------|:--------------------------------------------------------------|
126-
| layers | counts | integer | Raw counts |
127-
| layers | normalized | double | Normalised expression values |
128-
| obs | celltype | string | Cell type information |
129-
| obs | batch | string | Batch information |
130-
| obs | tissue | string | Tissue information |
131-
| obs | size_factors | double | The size factors created by the normalisation method, if any. |
132-
| obsm | X_pca | double | The resulting PCA embedding. |
133-
| varm | pca_loadings | double | The PCA loadings matrix. |
134-
| uns | dataset_id | string | A unique identifier for the dataset |
135-
| uns | normalization_id | string | Which normalization was used |
136-
| uns | pca_variance | double | The PCA variance objects. |
137-
138-
Example:
139-
140-
AnnData object
141-
obs: 'celltype', 'batch', 'tissue', 'size_factors'
142-
uns: 'dataset_id', 'normalization_id', 'pca_variance'
143-
obsm: 'X_pca'
144-
varm: 'pca_loadings'
145-
layers: 'counts', 'normalized'
146-
147-
### `Raw Dataset`
148-
149-
An unprocessed dataset as output by a dataset loader.
150-
151-
Used in:
152-
153-
- [dataset loader](#dataset%20loader): output (as output)
154-
- [normalization](#normalization): input (as input)
155-
156-
Slots:
157-
158-
| struct | name | type | description |
159-
|:-------|:-----------|:--------|:------------------------------------|
160-
| layers | counts | integer | Raw counts |
161-
| obs | celltype | string | Cell type information |
162-
| obs | batch | string | Batch information |
163-
| obs | tissue | string | Tissue information |
164-
| uns | dataset_id | string | A unique identifier for the dataset |
165-
166-
Example:
167-
168-
AnnData object
169-
obs: 'celltype', 'batch', 'tissue'
170-
uns: 'dataset_id'
171-
layers: 'counts'
172-
173-
## Component API
174-
175-
### `Dataset Loader`
176-
177-
Arguments:
178-
179-
| Name | Type | Direction | Description |
180-
|:-----------|:------------------------------|:----------|:------------------------------------------------------|
181-
| `--output` | [Raw Dataset](#Raw%20dataset) | output | An unprocessed dataset as output by a dataset loader. |
182-
183-
### `Normalization`
184-
185-
Arguments:
186-
187-
| Name | Type | Direction | Description |
188-
|:---------------------|:--------------------------------------------|:----------|:-------------------------------------------------------------|
189-
| `--input` | [Raw Dataset](#Raw%20dataset) | input | An unprocessed dataset as output by a dataset loader. |
190-
| `--output` | [Normalized Dataset](#Normalized%20dataset) | output | A normalized dataset |
191-
| `--layer_output` | `string` | input | The name of the layer in which to store the normalized data. |
192-
| `--obs_size_factors` | `string` | input | In which .obs slot to store the size factors (if any). |
193-
194-
### `Processor Hvg`
195-
196-
Arguments:
197-
198-
| Name | Type | Direction | Description |
199-
|:------------------|:------------------------------------|:----------|:---------------------------------------------------------------------------|
200-
| `--input` | [Dataset+Pca](#Dataset+PCA) | input | A normalised data with a PCA embedding |
201-
| `--layer_input` | `string` | input | Which layer to use as input for the PCA. |
202-
| `--output` | [Dataset+Pca+Hvg](#Dataset+PCA+HVG) | output | A normalised data with a PCA embedding and HVG selection |
203-
| `--var_hvg` | `string` | input | In which .var slot to store whether a feature is considered to be hvg. |
204-
| `--var_hvg_score` | `string` | input | In which .var slot to store whether a ranking of the features by variance. |
205-
| `--num_features` | `integer` | input | The number of HVG to select |
206-
207-
### `Processor Pca`
208-
209-
Arguments:
210-
211-
| Name | Type | Direction | Description |
212-
|:-------------------|:--------------------------------------------|:----------|:---------------------------------------------------------------------------------------------------------------------|
213-
| `--input` | [Normalized Dataset](#Normalized%20dataset) | input | A normalized dataset |
214-
| `--layer_input` | `string` | input | Which layer to use as input for the PCA. |
215-
| `--output` | [Dataset+Pca](#Dataset+PCA) | output | A normalised data with a PCA embedding |
216-
| `--obsm_embedding` | `string` | input | In which .obsm slot to store the resulting embedding. |
217-
| `--varm_loadings` | `string` | input | In which .varm slot to store the resulting loadings matrix. |
218-
| `--uns_variance` | `string` | input | In which .uns slot to store the resulting variance objects. |
219-
| `--num_components` | `integer` | input | Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation. |
3+
# This directory has been moved to [https://github.com/openproblems-bio/datasets](https://github.com/openproblems-bio/datasets)!

0 commit comments

Comments
 (0)