Skip to content

Commit 634418d

Browse files
authored
docs: documentation for platform support and requirements (#68)
* docs: re-writing pipeline dev page to platform support * docs: rclean up some details * docs: clarify pipeline/platform relationship
1 parent fc4f0af commit 634418d

3 files changed

Lines changed: 72 additions & 49 deletions

File tree

docs/source/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ policies_practices/data_organization.md
6666
policies_practices/data_governance.md
6767
6868
policies_practices/software_practices.md
69-
policies_practices/pipeline_development.md
69+
policies_practices/platform_support.md
7070
policies_practices/docs
7171
policies_practices/developer_templates
7272

docs/source/policies_practices/pipeline_development.md

Lines changed: 0 additions & 48 deletions
This file was deleted.
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Platforms and pipelines
2+
3+
A ‘platform’ is an integrated ecosystem of standardized material preparation, data acquisition methods and robust data pipelines. Platforms are characterized by efficient, hardened hardware and software, standardized operational processes, mature data models, and quality control, including real-time dashboards. Platforms receive specific support from the data and software teams at AIND and are subject to specific requirements.
4+
5+
Pipelines are the automated per-modality processing performed after a platform uploads data. Pipelines have specific data and metadata requirements they must conform to.
6+
7+
All platforms and pipelines must follow the [Data organization conventions](data_organization.md) for file organization in raw and derived assets.
8+
9+
## Platform support
10+
11+
[Todo] This section will be filled out in more detail as platform operations are finalized across teams. Expect information on:
12+
13+
- Quality assurance
14+
- Asset tracking dashboards
15+
16+
## Platform requirements
17+
18+
### Logging
19+
20+
Platforms should log all events to the [Loki server](https://github.com/AllenNeuralDynamics/aind-log-utils) maintained by SIPE. Events should be discrete information, warnings, and errors that need to be made visible to users in a dashboard. Logging of continuous metrics should be done in a log service that is specific to each tool and made visible in a dashboard attached to the tool.
21+
22+
### Quality control
23+
24+
Platforms are required to generate and annotate (i.e. mark as passing or failing) quality control metrics that can be used to filter data assets. Only assets that pass the quality control metrics relevant to an analysis should be used. See the [quality control page](../explore_analyze/quality_control.md) for more details on metrics and the QC Portal.
25+
26+
## Pipeline development
27+
28+
Pipelines are a standardized series of processing steps that take raw data from a single modality and typically produce an NWB file as output. Pipelines are organized in a Nextflow pipeline with individual Code Ocean capsules performing internal steps. Platforms should use established pipelines for processing.
29+
30+
- Accept raw data from a single modality, organized according to [aind-file-standards](https://github.com/AllenNeuralDynamics/aind-file-standards) when applicable.
31+
- Produce an NWB file as output
32+
- Assess data quality and produce QC metrics and references
33+
34+
### Pipeline metadata
35+
36+
Please see the [aind-metadata-manager](https://github.com/AllenNeuralDynamics/aind-metadata-manager/), many of the requirements described here are possible through the manager using simple functions.
37+
38+
#### data_description.json
39+
40+
All processing pipelines that create derived assets should upgrade the [data_description](https://aind-data-schema.readthedocs.io/en/latest/data_description.html) to a derived data description (changing the name and data_level).
41+
42+
Use the [`DataDescription.from_data_description()`](https://github.com/AllenNeuralDynamics/aind-data-schema/blob/e172cb06a63b722eaeaaf8933d0a17cbedf3feea/src/aind_data_schema/core/data_description.py#L334) function to create derived data_description objects. Pass the process name as a parameter, often just `process_name="processed"`. If more source data assets were used than just the one being passed into the function then pass the optional `source_data` parameter as well with the names of those data assets.
43+
44+
```python
45+
from pathlib import Path
46+
from aind_data_schema.core.data_description import DataDescription
47+
48+
# Load the original data_description.json
49+
original_data_description = DataDescription.model_validate_json(
50+
Path("data_description.json").read_text()
51+
)
52+
53+
# Upgrade to the new derived data_description
54+
derived_data_description = DataDescription.from_data_description(
55+
data_description=original_data_description,
56+
process_name="processed"
57+
)
58+
59+
# Write the derived data_description to the results directory
60+
derived_data_description.write_standard_file(output_directory="/results")
61+
```
62+
63+
#### processing.json
64+
65+
Processing pipelines need to track each [DataProcess](https://aind-data-schema.readthedocs.io/en/latest/processing.html#dataprocess) that was run to create the derived data asset.
66+
67+
If processing was performed as part of a nextflow pipeline, that should be tracked in the `Processing.pipelines` field using a [Code](https://aind-data-schema.readthedocs.io/en/latest/components/identifiers.html#code) object pointing to the github repository with the nextflow configuration. Use the `DataProcess.pipeline_name` field to indicate that processes were run as part of a pipeline.
68+
69+
#### Other metadata
70+
71+
Core metadata `.json` files that are not modified should be copied to the derived asset unchanged. If it's present, do not copy the `metadata.nd.json` file -- this file is synchronized by the indexer and should not be moved manually.

0 commit comments

Comments
 (0)