Skip to content
This repository was archived by the owner on Mar 23, 2026. It is now read-only.

Commit 21f6511

Browse files
authored
Merge pull request #74 from cmatKhan/touch_up_docs
cleaning up docs
2 parents 2724b59 + 809594a commit 21f6511

2 files changed

Lines changed: 30 additions & 25 deletions

File tree

docs/virtual_db.md

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,31 @@
11
# VirtualDB
22

3-
VirtualDB provides a unified query interface across heterogeneous datasets with
4-
different experimental condition structures and terminologies. Each dataset
5-
defines experimental conditions in its own way, with properties stored at
6-
different hierarchy levels (repository, dataset, or field) and using different
7-
naming conventions. VirtualDB uses an external YAML configuration to map these
8-
varying structures to a common schema, normalize factor level names (e.g.,
9-
"D-glucose", "dextrose", "glu" all become "glucose"), and enable cross-dataset
10-
queries with standardized field names and values.
3+
VirtualDB provides a SQL query interface across heterogeneous HuggingFace
4+
datasets using an in-memory DuckDB database. Each dataset defines experimental
5+
conditions in its own way, with properties stored at different hierarchy levels
6+
(repository, dataset, or field) and using different naming conventions.
7+
VirtualDB uses an external YAML configuration to map these varying structures
8+
to a common schema, normalize factor level names (e.g., "D-glucose",
9+
"dextrose", "glu" all become "glucose"), and enable cross-dataset queries with
10+
standardized field names and values.
1111

12-
## API Reference
12+
For primary datasets, VirtualDB creates:
1313

14-
::: tfbpapi.virtual_db.VirtualDB
15-
options:
16-
show_root_heading: true
17-
show_source: true
14+
- **`<db_name>_meta`** -- one row per sample with derived metadata columns
15+
- **`<db_name>`** -- full measurement-level data joined to the metadata view
1816

19-
### Helper Functions
17+
For comparative analysis datasets, VirtualDB creates:
2018

21-
::: tfbpapi.virtual_db.get_nested_value
22-
options:
23-
show_root_heading: true
19+
- **`<db_name>_expanded`** -- the raw data with composite ID fields parsed
20+
into `<link_field>_source` (aliased to configured `db_name`) and
21+
`<link_field>_id` (sample_id) columns
22+
23+
See the [configuration guide](virtual_db_configuration.md) for setup details
24+
and the [tutorial](tutorials/virtual_db_tutorial.ipynb) for usage examples.
25+
26+
## API Reference
2427

25-
::: tfbpapi.virtual_db.normalize_value
28+
::: tfbpapi.virtual_db.VirtualDB
2629
options:
2730
show_root_heading: true
31+
show_source: true

tfbpapi/virtual_db.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,14 @@
66
https://brentlab.github.io/tfbpapi/huggingface_datacard/. Next, a developer can create
77
a virtualDB configuration file that describes which huggingface repos and datasets to
88
use, a set of common fields, datasets that contain comparative analytics, and more.
9-
VirtualDB, this code, then uses DuckDB to construct tables and views are
10-
which are lazily created over Parquet files which are cached locally. VirtualDB uses
11-
the information in the datacard to create metadata views which describe sample level
12-
features. Derived columns are attached to both the metadata and full data views. Any
13-
comparative analysis datasets are also parsed and joined to the primary datasets'
14-
metadata views. The expectation is that a developer will use this interface to write
15-
SQL queries against the views to provide an API to downstream users and applications.
9+
VirtualDB, this code, then uses DuckDB to construct views that are lazily created
10+
over Parquet files cached locally. For primary datasets, VirtualDB creates metadata
11+
views (one row per sample with derived columns) and full data views (measurement-level
12+
data joined to metadata). For comparative analysis datasets, VirtualDB creates expanded
13+
views that parse composite ID fields into ``_source`` (aliased to the configured
14+
db_name) and ``_id`` (sample_id) columns. The expectation is that a developer will
15+
use this interface to write SQL queries against the views to provide an API to
16+
downstream users and applications.
1617
1718
Example Usage::
1819

0 commit comments

Comments
 (0)