Skip to content

Commit 03d2763

Browse files
authored
docs: working on data governance (#15)
* docs: working on data governance * docs: top-level links, last update to page * chore: comment out test_and_lint * chore: remove init.yml * docs: refactor move some content from governance -> organization * Delete test_and_lint.yml
1 parent e8f720b commit 03d2763

6 files changed

Lines changed: 76 additions & 80 deletions

File tree

.github/workflows/init.yml

Lines changed: 0 additions & 52 deletions
This file was deleted.

.github/workflows/test_and_lint.yml

Lines changed: 0 additions & 26 deletions
This file was deleted.

docs/source/philosophy.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ I want to learn about...
1010
:hidden:
1111
1212
philosophy/data_organization.md
13+
philosophy/data_governance.md
1314
philosophy/scicomp-teams.md
1415
1516
```
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Data governance (policy)
2+
3+
Primary data are precious - especially manually generated data annotations and recordings of unique specimens. Such data must be treated with care and respect. This document describes how primary data should be handled to:
4+
5+
- Minimize unintended changes
6+
- Avoid unnecessary costs for storage and compute
7+
- Maintain agility
8+
9+
Teams requiring exceptions to any of the policies below should make their requests via email to the AIND Leadership Team.
10+
11+
## Raw data must be compressed
12+
13+
Compression should be performed immediately after acquisition and prior to cloud upload. Compression algorithms should be as aggressive as possible:
14+
15+
- lossless at minimum, preferably lossy
16+
- without compromising downstream analysis
17+
18+
19+
We will migrate to lossy compression when feature extraction and evaluation algorithms are stable. Teams must plan for dedicated time and resources to investigate the effects of lossy compression.
20+
21+
## Code that can modify primary data must be reviewed and tested
22+
23+
This includes primary data on-premise and cloud storage as well as metadata in Sharepoint, Smartsheet, Power Platform, or other databases. Reviewers must be familiar with the data and capable of reviewing the code. Testing must be done in a safe environment on example data.
24+
25+
Best practice is to review scripts that can potentially modify data in a GitHub repository pull request that must be reviewed prior to use. See Scientific Computing’s [aind-data-migration-scripts](https://github.com/AllenNeuralDynamics/aind-data-migration-scripts) repository and the Code Review section here: Software Engineering Practices.docx.
26+
27+
To test scripts, create folders in a separate location with a copy of the data you would like to update and run your script there. Good practices include:
28+
29+
- dry runs that log intended changes without making changes
30+
- limited runs that only affect one test data asset before full runs
31+
- test runs that write to an alternative "scratch" location
32+
33+
34+
## Users who modify primary data must only use write permissions when needed
35+
36+
Users who log into services that control primary data (e.g. VAST, AWS, Power Platform) should log in with read-only permissions unless write permissions are needed for the task at hand. Systems must have accounts configured in such a way that this is possible (e.g. individual read-only accounts and a single write-enabled account).
37+
38+
This includes acquisition workstations. Accounts that have permission to acquire and potentially change primary data should only be used for that purpose. At all other times read-only user accounts should be used.
39+
40+
## Derived results should not be written to the same folder as primary data
41+
42+
This minimizes the chances of accidentally overwriting primary data due to e.g. spelling mistakes. The only exception to this is when derived results are computed as part of the acquisition process (e.g. the ExaSPIM acquisition software also saves maximum intensity projections of tiles).

docs/source/philosophy/data_organization.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Data organization
1+
# Data organization (policy)
22

33
This document describes how data and metadata should be organized before it is copied into cloud storage. It covers core concepts, file names, directory structures, and metadata conventions.
44

@@ -171,6 +171,35 @@ When naming files, we should:
171171
- Do this: EFIP_655568_2022-04-26_11-48-09
172172
- Do not include illegal filename characters in tokens
173173

174+
## AIND Implementation
175+
176+
### On-premise systems should not be used for persistent, long-term data storage.
177+
178+
AIND’s high-performance on-premise storage system (VAST) is sized to be a ~2-week transfer buffer that enables low-level computing (e.g. compression, format conversion) and rapid transfer to cloud storage systems. Any data stored in on-premise scratch space for more than two weeks is subject to requests for deletion at any time.
179+
180+
The VAST system has two partitions:
181+
182+
1. Stage (1600TB): an access-controlled buffer for raw data compression and upload. No individual user accounts will have write-access to this partition – only service accounts on acquisition workstations. The stage partition has daily snapshots that expire after 3 days. `\\allen\aind\stage` (windows) `/allen/aind/stage` (linux)
183+
2. Scratch (200TB): an uncontrolled space for all AIND team members to use. This share can be read and modified by any account. Data stored here is considered transient, not intended for public sharing, and subject to requests for deletion. Recommended scratch share organization is to have top level directories for each AIND group (ephys, ophys, etc), then subfolders for individual users. The scratch partition has daily snapshots that expire after 2 weeks. `\\allen\aind\scratch` (windows) `/allen/aind/scratch` (linux)
184+
185+
Open a ServiceNow ticket to restore data from a snapshot.
186+
187+
### Raw data can only be uploaded to cloud using the Data Transfer Service.
188+
189+
When manually uploading data to cloud buckets, it is easy to make mistakes that can affect others’ data. The data transfer service is designed to automatically organize data and metadata consistently and prevent accidentally overwriting data.
190+
191+
Cloud storage is organized as follows:
192+
193+
aind-private-data: a read-only private S3 bucket organized by session
194+
195+
aind-open-data: a read-only public S3 bucket organized by session
196+
197+
aind-scratch-data: a private S3 bucket that is writable by all AIND staff
198+
199+
Because it is easy to delete large amounts of data on accident, very few AIND users have the ability to modify or delete data in aind-open-data and aind-private-data. If errors are detected in data in those buckets, contact Scientific Computing. Mitigate errors by testing upload jobs on aind-scratch-data.
200+
201+
aind-open-data and aind-private-data are organized according to [Data Organization Conventions](<data_organization>). These conventions enable us to have consistently organized data that can be shared rapidly and openly. The Data Transfer Service organizes data according to these conventions as it uploads data.
202+
174203
## Human-in-the-loop Processing Pipelines
175204

176205
During preliminary phases of processing pipeline development, it is common to defer downstream processing until upstream processing has been manually validated. This is particularly important for pipelines involving expensive processing steps that are sensitive to the quality of upstream results.

docs/source/philosophy/scicomp-teams.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@ You should set patch version floor `>=1.0.0` and major version ceiling `<2` for
2020

2121
## Data Infrastructure
2222

23-
Data Infrastructure maintains the core services in AIND. Some of the major ones include:
23+
Data Infrastructure maintains the core services in AIND.
24+
25+
Some of the major core services:
2426

2527
**aind-data-transfer-service**
2628

0 commit comments

Comments
 (0)