@@ -10,8 +10,10 @@ levels.
1010repositories :
1111 # Each repository defines a "table" in the virtual database
1212 BrentLab/harbison_2004 :
13- # REQUIRED: Specify which field is the sample identifier. At this level, it means
14- # that all datasets have a field `sample_id` that uniquely identifies samples.
13+ # REQUIRED: Specify which column is the sample identifier. The `field`
14+ # value is the actual column name in the parquet data. At the repo level,
15+ # it applies to all datasets in this repository. If not specified at
16+ # either level, the default column name "sample_id" is assumed.
1517 sample_id :
1618 field : sample_id
1719 # Repository-wide properties (apply to all datasets in this repository)
@@ -47,8 +49,9 @@ repositories:
4749 kemmeren_2014 :
4850 # optional -- see the note for `db_name` in harbison above
4951 db_name : kemmeren
50- # REQUIRED: If `sample_id` isn't defined at the repo level, then it must be
51- # defined at the dataset level for each dataset in the repo
52+ # REQUIRED: If `sample_id` isn't defined at the repo level, it must be
53+ # defined at the dataset level. The `field` value is the actual column
54+ # name in the parquet data (does not need to be literally "sample_id").
5255 sample_id :
5356 field : sample_id
5457 # Same logical fields, different physical paths
@@ -144,6 +147,62 @@ during metadata extraction and query filtering.
1441472. **Type consistency**: When source data might be extracted with incorrect type
1451483. **Performance**: Helps with query optimization and prevents type mismatches
146149
150+ ## Tags
151+
152+ Tags are arbitrary string key/value pairs for annotating datasets. They follow
153+ the same hierarchy as property mappings: repo-level tags apply to all datasets
154+ in the repository, dataset-level tags apply only to that dataset, and
155+ dataset-level tags override repo-level tags with the same key.
156+
157+ ```yaml
158+ repositories:
159+ BrentLab/harbison_2004:
160+ # Repo-level tags apply to all datasets in this repository
161+ tags:
162+ assay: binding
163+ organism: yeast
164+ dataset:
165+ harbison_2004:
166+ sample_id:
167+ field: sample_id
168+ # Dataset-level tags override repo-level tags with the same key
169+ tags:
170+ assay: chip-chip
171+
172+ BrentLab/kemmeren_2014:
173+ tags:
174+ assay: perturbation
175+ organism: yeast
176+ dataset:
177+ kemmeren_2014:
178+ sample_id:
179+ field: sample_id
180+ ```
181+
182+ Access merged tags via `vdb.get_tags(db_name)`, identifying datasets by
183+ their name as it appears in `vdb.tables()`:
184+
185+ ```python
186+ from tfbpapi.virtual_db import VirtualDB
187+
188+ vdb = VirtualDB("datasets.yaml")
189+
190+ # Returns {"assay": "chip-chip", "organism": "yeast"}
191+ # (dataset-level assay overrides repo-level)
192+ vdb.get_tags("harbison")
193+
194+ # Returns {"assay": "perturbation", "organism": "yeast"}
195+ vdb.get_tags("kemmeren")
196+ ```
197+
198+ The underlying `MetadataConfig` (available as `vdb.config`) exposes the same
199+ data via `(repo_id, config_name)` pairs for programmatic or developer use:
200+
201+ ```python
202+ # Equivalent to vdb.get_tags("harbison") above
203+ vdb.config.get_tags("BrentLab/harbison_2004", "harbison_2004")
204+ ```
205+
147206## Comparative Datasets
148207
149208Comparative datasets differ from other dataset types in that they represent
@@ -152,9 +211,10 @@ Each row relates 2+ samples from other datasets.
152211
153212### Structure
154213
155- Comparative datasets use `source_sample` fields instead of a single `sample_id`:
214+ Comparative datasets use `source_sample` fields instead of a single sample
215+ identifier column:
156216- Multiple fields with `role: source_sample`
157- - Each contains composite identifier: `"repo_id;config_name;sample_id "`
217+ - Each contains composite identifier: `"repo_id;config_name;sample_id_value "`
158218- Example: `binding_id = "BrentLab/harbison_2004;harbison_2004;42"`
159219
160220### Fields
@@ -206,10 +266,11 @@ build on each other. Using `harbison` as an example primary dataset and
206266
207267**1. Metadata view**
208268
209- One row per unique `sample_id`. Derived columns from the configuration
210- (e.g., `carbon_source`, `temperature_celsius`) are resolved here using
211- datacard definitions, factor aliases, and missing value labels. This is
212- the primary view for querying sample-level metadata.
269+ One row per unique sample identifier (the column configured via
270+ `sample_id : {field: <column_name>}`). Derived columns from the
271+ configuration (e.g., `carbon_source`, `temperature_celsius`) are resolved
272+ here using datacard definitions, factor aliases, and missing value labels.
273+ This is the primary view for querying sample-level metadata.
213274
214275**2. Raw data view**
215276
@@ -239,7 +300,7 @@ or filter by source dataset without parsing composite IDs in SQL.
239300```
240301__ harbison_parquet (raw parquet, not directly exposed)
241302 |
242- +-> harbison_meta (deduplicated, one row per sample_id ,
303+ +-> harbison_meta (deduplicated, one row per sample identifier ,
243304 | with derived columns from config)
244305 |
245306 +-> harbison (full parquet joined to harbison_meta)
0 commit comments