Skip to content
This repository was archived by the owner on Mar 23, 2026. It is now read-only.

Commit ee655ad

Browse files
authored
Merge pull request #79 from cmatKhan/dealing_with_nonembedded
merge now, deal with code review later
2 parents 21f6511 + 3cb8cf7 commit ee655ad

8 files changed

Lines changed: 3330 additions & 1723 deletions

File tree

docs/tutorials/virtual_db_tutorial.ipynb

Lines changed: 1653 additions & 1642 deletions
Large diffs are not rendered by default.

docs/virtual_db.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,66 @@ For comparative analysis datasets, VirtualDB creates:
2323
See the [configuration guide](virtual_db_configuration.md) for setup details
2424
and the [tutorial](tutorials/virtual_db_tutorial.ipynb) for usage examples.
2525

26+
## Advanced Usage
27+
28+
After any public method is called (e.g. `vdb.tables()`), the underlying DuckDB
29+
connection is available as `vdb._db`. You can use `_db` to execute any SQL
30+
on the database, eg creating more views, or creating a table in memory
31+
32+
Custom **views** created this way appear in `tables()`, `describe()`, and
33+
`get_fields()` automatically because those methods query DuckDB's
34+
`information_schema`. Custom **tables** do not appear in `tables()` (which
35+
only lists views), but are fully queryable via `vdb.query()`.
36+
37+
Call at least one public method first to ensure the connection is initialized
38+
before accessing `_db` directly.
39+
40+
Example -- create a materialized analysis table::
41+
42+
# Trigger view registration
43+
vdb.tables()
44+
45+
# Create a persistent in-memory table from a complex query.
46+
# This example selects one "best" Hackett-2020 sample per regulator
47+
# using a priority system: ZEV+P > GEV+P > GEV+M.
48+
vdb._db.execute("""
49+
CREATE OR REPLACE TABLE hackett_analysis_set AS
50+
WITH regulator_tiers AS (
51+
SELECT
52+
regulator_locus_tag,
53+
CASE
54+
WHEN BOOL_OR(mechanism = 'ZEV' AND restriction = 'P') THEN 1
55+
WHEN BOOL_OR(mechanism = 'GEV' AND restriction = 'P') THEN 2
56+
ELSE 3
57+
END AS tier
58+
FROM hackett_meta
59+
WHERE regulator_locus_tag NOT IN ('Z3EV', 'GEV')
60+
GROUP BY regulator_locus_tag
61+
),
62+
tier_filter AS (
63+
SELECT
64+
h.sample_id, h.regulator_locus_tag, h.regulator_symbol,
65+
h.mechanism, h.restriction, h.date, h.strain, t.tier
66+
FROM hackett_meta h
67+
JOIN regulator_tiers t USING (regulator_locus_tag)
68+
WHERE
69+
(t.tier = 1 AND h.mechanism = 'ZEV' AND h.restriction = 'P')
70+
OR (t.tier = 2 AND h.mechanism = 'GEV' AND h.restriction = 'P')
71+
OR (t.tier = 3 AND h.mechanism = 'GEV' AND h.restriction = 'M')
72+
)
73+
SELECT DISTINCT
74+
sample_id, regulator_locus_tag, regulator_symbol,
75+
mechanism, restriction, date, strain
76+
FROM tier_filter
77+
WHERE regulator_symbol NOT IN ('GCN4', 'RDS2', 'SWI1', 'MAC1')
78+
ORDER BY regulator_locus_tag, sample_id
79+
""")
80+
81+
df = vdb.query("SELECT * FROM hackett_analysis_set")
82+
83+
Tables and views created this way are in-memory only and do not persist across
84+
VirtualDB instances. They exist for the lifetime of the DuckDB connection.
85+
2686
## API Reference
2787

2888
::: tfbpapi.virtual_db.VirtualDB

docs/virtual_db_configuration.md

Lines changed: 72 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,10 @@ levels.
1010
repositories:
1111
# Each repository defines a "table" in the virtual database
1212
BrentLab/harbison_2004:
13-
# REQUIRED: Specify which field is the sample identifier. At this level, it means
14-
# that all datasets have a field `sample_id` that uniquely identifies samples.
13+
# REQUIRED: Specify which column is the sample identifier. The `field`
14+
# value is the actual column name in the parquet data. At the repo level,
15+
# it applies to all datasets in this repository. If not specified at
16+
# either level, the default column name "sample_id" is assumed.
1517
sample_id:
1618
field: sample_id
1719
# Repository-wide properties (apply to all datasets in this repository)
@@ -47,8 +49,9 @@ repositories:
4749
kemmeren_2014:
4850
# optional -- see the note for `db_name` in harbison above
4951
db_name: kemmeren
50-
# REQUIRED: If `sample_id` isn't defined at the repo level, then it must be
51-
# defined at the dataset level for each dataset in the repo
52+
# REQUIRED: If `sample_id` isn't defined at the repo level, it must be
53+
# defined at the dataset level. The `field` value is the actual column
54+
# name in the parquet data (does not need to be literally "sample_id").
5255
sample_id:
5356
field: sample_id
5457
# Same logical fields, different physical paths
@@ -144,6 +147,62 @@ during metadata extraction and query filtering.
144147
2. **Type consistency**: When source data might be extracted with incorrect type
145148
3. **Performance**: Helps with query optimization and prevents type mismatches
146149
150+
## Tags
151+
152+
Tags are arbitrary string key/value pairs for annotating datasets. They follow
153+
the same hierarchy as property mappings: repo-level tags apply to all datasets
154+
in the repository, dataset-level tags apply only to that dataset, and
155+
dataset-level tags override repo-level tags with the same key.
156+
157+
```yaml
158+
repositories:
159+
BrentLab/harbison_2004:
160+
# Repo-level tags apply to all datasets in this repository
161+
tags:
162+
assay: binding
163+
organism: yeast
164+
dataset:
165+
harbison_2004:
166+
sample_id:
167+
field: sample_id
168+
# Dataset-level tags override repo-level tags with the same key
169+
tags:
170+
assay: chip-chip
171+
172+
BrentLab/kemmeren_2014:
173+
tags:
174+
assay: perturbation
175+
organism: yeast
176+
dataset:
177+
kemmeren_2014:
178+
sample_id:
179+
field: sample_id
180+
```
181+
182+
Access merged tags via `vdb.get_tags(db_name)`, identifying datasets by
183+
their name as it appears in `vdb.tables()`:
184+
185+
```python
186+
from tfbpapi.virtual_db import VirtualDB
187+
188+
vdb = VirtualDB("datasets.yaml")
189+
190+
# Returns {"assay": "chip-chip", "organism": "yeast"}
191+
# (dataset-level assay overrides repo-level)
192+
vdb.get_tags("harbison")
193+
194+
# Returns {"assay": "perturbation", "organism": "yeast"}
195+
vdb.get_tags("kemmeren")
196+
```
197+
198+
The underlying `MetadataConfig` (available as `vdb.config`) exposes the same
199+
data via `(repo_id, config_name)` pairs for programmatic or developer use:
200+
201+
```python
202+
# Equivalent to vdb.get_tags("harbison") above
203+
vdb.config.get_tags("BrentLab/harbison_2004", "harbison_2004")
204+
```
205+
147206
## Comparative Datasets
148207
149208
Comparative datasets differ from other dataset types in that they represent
@@ -152,9 +211,10 @@ Each row relates 2+ samples from other datasets.
152211
153212
### Structure
154213
155-
Comparative datasets use `source_sample` fields instead of a single `sample_id`:
214+
Comparative datasets use `source_sample` fields instead of a single sample
215+
identifier column:
156216
- Multiple fields with `role: source_sample`
157-
- Each contains composite identifier: `"repo_id;config_name;sample_id"`
217+
- Each contains composite identifier: `"repo_id;config_name;sample_id_value"`
158218
- Example: `binding_id = "BrentLab/harbison_2004;harbison_2004;42"`
159219
160220
### Fields
@@ -206,10 +266,11 @@ build on each other. Using `harbison` as an example primary dataset and
206266

207267
**1. Metadata view**
208268

209-
One row per unique `sample_id`. Derived columns from the configuration
210-
(e.g., `carbon_source`, `temperature_celsius`) are resolved here using
211-
datacard definitions, factor aliases, and missing value labels. This is
212-
the primary view for querying sample-level metadata.
269+
One row per unique sample identifier (the column configured via
270+
`sample_id: {field: <column_name>}`). Derived columns from the
271+
configuration (e.g., `carbon_source`, `temperature_celsius`) are resolved
272+
here using datacard definitions, factor aliases, and missing value labels.
273+
This is the primary view for querying sample-level metadata.
213274

214275
**2. Raw data view**
215276

@@ -239,7 +300,7 @@ or filter by source dataset without parsing composite IDs in SQL.
239300
```
240301
__harbison_parquet (raw parquet, not directly exposed)
241302
|
242-
+-> harbison_meta (deduplicated, one row per sample_id,
303+
+-> harbison_meta (deduplicated, one row per sample identifier,
243304
| with derived columns from config)
244305
|
245306
+-> harbison (full parquet joined to harbison_meta)

0 commit comments

Comments
 (0)