|
1 | | -# Data organization |
| 1 | +# Data organization (policy) |
2 | 2 |
|
3 | 3 | This document describes how data and metadata should be organized before it is copied into cloud storage. It covers core concepts, file names, directory structures, and metadata conventions. |
4 | 4 |
|
@@ -171,6 +171,35 @@ When naming files, we should: |
171 | 171 | - Do this: EFIP_655568_2022-04-26_11-48-09 |
172 | 172 | - Do not include illegal filename characters in tokens |
173 | 173 |
|
| 174 | +## AIND Implementation |
| 175 | + |
| 176 | +### On-premise systems should not be used for persistent, long-term data storage. |
| 177 | + |
| 178 | +AIND’s high-performance on-premise storage system (VAST) is sized to be a ~2-week transfer buffer that enables low-level computing (e.g. compression, format conversion) and rapid transfer to cloud storage systems. Any data stored in on-premise scratch space for more than two weeks is subject to requests for deletion at any time. |
| 179 | + |
| 180 | +The VAST system has two partitions: |
| 181 | + |
| 182 | +1. Stage (1600TB): an access-controlled buffer for raw data compression and upload. No individual user accounts will have write-access to this partition – only service accounts on acquisition workstations. The stage partition has daily snapshots that expire after 3 days. `\\allen\aind\stage` (windows) `/allen/aind/stage` (linux) |
| 183 | +2. Scratch (200TB): an uncontrolled space for all AIND team members to use. This share can be read and modified by any account. Data stored here is considered transient, not intended for public sharing, and subject to requests for deletion. Recommended scratch share organization is to have top level directories for each AIND group (ephys, ophys, etc), then subfolders for individual users. The scratch partition has daily snapshots that expire after 2 weeks. `\\allen\aind\scratch` (windows) `/allen/aind/scratch` (linux) |
| 184 | + |
| 185 | +Open a ServiceNow ticket to restore data from a snapshot. |
| 186 | + |
| 187 | +### Raw data can only be uploaded to cloud using the Data Transfer Service. |
| 188 | + |
| 189 | +When manually uploading data to cloud buckets, it is easy to make mistakes that can affect others’ data. The data transfer service is designed to automatically organize data and metadata consistently and prevent accidentally overwriting data. |
| 190 | + |
| 191 | +Cloud storage is organized as follows: |
| 192 | + |
| 193 | + aind-private-data: a read-only private S3 bucket organized by session |
| 194 | + |
| 195 | + aind-open-data: a read-only public S3 bucket organized by session |
| 196 | + |
| 197 | + aind-scratch-data: a private S3 bucket that is writable by all AIND staff |
| 198 | + |
| 199 | +Because it is easy to delete large amounts of data on accident, very few AIND users have the ability to modify or delete data in aind-open-data and aind-private-data. If errors are detected in data in those buckets, contact Scientific Computing. Mitigate errors by testing upload jobs on aind-scratch-data. |
| 200 | + |
| 201 | +aind-open-data and aind-private-data are organized according to [Data Organization Conventions](<data_organization>). These conventions enable us to have consistently organized data that can be shared rapidly and openly. The Data Transfer Service organizes data according to these conventions as it uploads data. |
| 202 | + |
174 | 203 | ## Human-in-the-loop Processing Pipelines |
175 | 204 |
|
176 | 205 | During preliminary phases of processing pipeline development, it is common to defer downstream processing until upstream processing has been manually validated. This is particularly important for pipelines involving expensive processing steps that are sensitive to the quality of upstream results. |
|
0 commit comments