Skip to content

Commit 7f455e5

Browse files
committed
docs: QC
1 parent f747820 commit 7f455e5

2 files changed

Lines changed: 115 additions & 19 deletions

File tree

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,11 @@
1-
# Quality control
1+
# Quality control
2+
3+
All data assets generated at AIND should undergo automated (and sometimes manual) quality control before being used in analysis. To make this as efficient as possible we provide a standardized metadata schema for tracking quality control metrics as well as a convenient portal for reviewing QC metadata.
4+
5+
## Preparing QC metadata
6+
7+
Please see the documentation on [QualityControl](https://aind-data-schema.readthedocs.io/en/latest/quality_control.html) for a comprehensive overview of QC.
8+
9+
## QC Portal
10+
11+
Please see the [QC Portal](https://github.com/AllenNeuralDynamics/aind-qc-portal?tab=readme-ov-file) documentation for more information.

docs/source/philosophy/data_organization.md

Lines changed: 104 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -66,23 +66,25 @@ A few points:
6666

6767
Primary data assets are organized as follows:
6868

69-
- <asset name, as above>
70-
- data_description.json (administrative information, funders, licenses, projects, etc)
71-
- subject.json (species, sex, DOB, unique identifier, genotype, etc)
72-
- procedures.json (subject surgeries, tissue preparation, water restriction, training protocols, etc)
73-
- instrument.json (static hardware components)
74-
- acquisition.json (device settings that change acquisition-to-acquisition)
75-
- <modality-1>
76-
- <list of data files>
77-
- <modality-2>
78-
- <list of data files>
79-
- <modality-n>
80-
- <list of data files>
81-
- derivatives (processed data generated during acquisition)
82-
- <label> (e.g. MIP)
83-
- <list of files>
84-
- logs (log files generated by the instrument or rig)
85-
- <list of files>
69+
```
70+
📦<asset name, as above>
71+
┣ 📜data_description.json (administrative information, funders, licenses, projects, etc)
72+
┣ 📜subject.json (species, sex, DOB, unique identifier, genotype, etc)
73+
┣ 📜procedures.json (subject surgeries, tissue preparation, water restriction, training protocols, etc)
74+
┣ 📜instrument.json (static hardware components)
75+
┣ 📜acquisition.json (device settings that change acquisition-to-acquisition)
76+
┣ 📂<modality-1>
77+
┃ ┗ 📜<list of data files>
78+
┣ 📂<modality-2>
79+
┃ ┗ 📜<list of data files>
80+
┣ 📂<modality-n>
81+
┃ ┗ 📜<list of data files>
82+
┣ 📂derivatives (processed data generated during acquisition)
83+
┃ ┗ 📂<label> (e.g. MIP)
84+
┃ ┃ ┗ 📜<list of files>
85+
┗ 📂logs (log files generated by the instrument or rig)
86+
┃ ┗ 📜<list of files>
87+
```
8688

8789
Modality terms come from controlled vocabularies in aind-data-schema-models.
8890

@@ -120,4 +122,88 @@ Example for exaSPIM data:
120122
┗ 📂derivatives
121123
┃ ┗ 📂MIP
122124
┃ ┃ ┗ 📜<list of e.g. tiff files>
123-
```
125+
```
126+
127+
## Derived data conventions
128+
129+
Anything computed in a single run should be logically grouped in a folder. The folder should be named:
130+
131+
<primary-asset-name>_<process-label>_<process-date>_<process-time>
132+
133+
Examples:
134+
135+
- ANM457202_2022-07-11_22-11-32_processed_2022-08-11_22-11-32
136+
- 595262_2022-02-21_15-18-07_processed_2022-08-11_22-11-32
137+
138+
Processed outputs are usually the result of a multi-stage pipeline handling a single data modality. Utilize a modality-specific <process-label>. Other common process labels include:
139+
140+
- “curation” - tags assigned to input data (e.g. merge/split/noise calls for ephys units)
141+
- ...
142+
143+
Overlong names are difficult to read, so do not daisy-chain. The goal is to keep names as simple as possible while being readable, not to encode all metadata or the entire provenance chain. If various stages of processing are being performed manually over extended periods of time, anchor each derived asset on the primary data asset.
144+
145+
Folder organization is as follows:
146+
147+
```
148+
📦<asset name, as above>
149+
┣ 📜data_description.json
150+
┣ 📜processing.json (describes the code, input parameters, outputs)
151+
┣ 📜subject.json (copied from primary asset)
152+
┣ 📜procedures.json (copied from primary asset)
153+
┣ 📜instrument.json (copied from primary asset)
154+
┣ 📜acquisition.json (copied from primary asset)
155+
┣ 📂<process-label-1>
156+
┃ ┗ 📜<list of files>
157+
┣ 📂<process-label-2>
158+
┃ ┗ 📜<list of files>
159+
┗ 📂<process-label-n>
160+
┃ ┗ 📜<list of files>
161+
```
162+
163+
## File name guidelines
164+
165+
When naming files, we should:
166+
167+
- use terms from vocabularies defined in aind-data-schema, e.g.
168+
- modalities, institutions
169+
- behavior video file names
170+
- use "yyyy-mm-dd" and "hh-mm-ss" in local time zone for dates and times
171+
- separate tokens with underscores, and not include underscores in tokens, e.g.
172+
- Do this: EFIP_655568_2022-04-26_11-48-09
173+
- Not this: EFIP-655568-2022_04_26-11_48_09
174+
- Do not include illegal filename characters in tokens
175+
176+
## Human-in-the-loop Processing Pipelines
177+
178+
During preliminary phases of processing pipeline development, it is common to defer downstream processing until upstream processing has been manually validated. This is particularly important for pipelines involving expensive processing steps that are sensitive to the quality of upstream results.
179+
180+
Pipelines that involve human interventions can be treated as regular (if slow-moving) pipelines. As researchers inspect and validate individual processing steps, they construct a processed data asset incrementally, one subfolder at a time.
181+
182+
Guidelines for constructing a processed data asset with a human in the loop:
183+
184+
1. It is okay to overwrite the results of the current processing step in-place.
185+
2. Always document provenance via an intermediate processing.json within the process subfolder
186+
3. Do not revisit previous processing steps. If necessary, delete the asset and start a new one.
187+
4. Once complete, generate integrated top-level JSON metadata (processing.json, quality_control.json)
188+
189+
## FAQ
190+
191+
**My data files already contain some of this metadata. Why store this in additional JSON files?**
192+
193+
How acquisition formats represent metadata evolves over time and often does not capture everything we need to know to interpret data. These JSON files represent our ground truth viewpoint on what is essential to know about our data in a single location.
194+
195+
Additionally, JSON files are trivially both human- and machine-readable. They are viewable on any system without additional software to be installed (a text editor is fine). They are easy to parse from code without any heavy dependencies (IGOR, H5PY, pynwb, etc).
196+
197+
**What are "Institution" and "Group" doing in data_description.json?**
198+
199+
In the future we may need to tag cloud resources based on the originating group, which may or may not be in AIND, in order to track usage and spending.
200+
201+
**Why are we replicating metadata that we are also tracking in PowerPlatform/LIMS/SLIMS/etc?**
202+
203+
Database systems such as these are very important for reliable acquisition, however they are also barriers to external interpretability and reproducibility. They have complex schema with extraneous information that make them difficult to interpret. They have query languages (e.g. SQL) that require training to use properly. Information becomes distributed across different locations. They may have security policies that make them difficult to share with the public.
204+
205+
Files, particularly in cloud storage, are reliable and more persistent. By storing metadata essential to interpreting an acquisition session alongside the acquisition in a human-and-machine-readable format, there will always be an interpretable record of what happened even if e.g. the database stops working.
206+
207+
**What happened to the "experiment type" and "platform" asset labels?**
208+
209+
Formerly we used a short label called “experiment type” in asset names.This concept was confusing because it was difficult to distinguish from a “modality”. Then we switched to “platform” and introduced a controlled vocabulary, but people did not understand the term and so defaulted to a “primary modality,” which was not helpful. Most of our data contains multiple modalities. A recording session may contain trained behavior event data (e.g. lick times), behavior videos (e.g. face camera), neuropixels recordings, and fiber photometry recordings.

0 commit comments

Comments
 (0)