Embeddings generated by AI models are typically Superimposed Embeddings (Multimodal/Fixed-Size). The concepts they represent overlap heavily, making meaningful interpretation or truncation impossible. Variable Resolution Embeddings (Multi-Scale/Adaptive), also known as Matryoshka Representation Learning (MRL), reorganize the embedding space so you can truncate while still retaining meaning. However, this is essentially a reduction in resolution (think of it like blurring an image). Devansh developed Fractal Embeddings which restructure the embedding space into a hierarchy. Truncation now allows moving up to a higher conceptual level while preserving full resolution and accuracy at that level. This repository explores applying Fractal Embeddings to embeddings derived from biology.
TL;DR; The fractal adapter works well improving accuracy over the original embeddings while learning a hierarchical structure based on biology.
A primary focus of foundation models in biology is single-cell RNA sequencing (scRNA-seq) data. These datasets typically represent each cell as a high-dimensional vector (~20,000 dimensions), where each entry corresponds to the expression level of a gene as a proxy for protein abundance.
Genentech has released SCimilarity, which includes:
- A reference dataset of approximately 23 million single-cell profiles
- A single-cell foundation model (scFM) that produces 128-dimensional embeddings per cell
The model was trained on a labeled subset of 7.9 million cells from 56 studies, annotated with Cell Ontology (CL) terms covering 203 unique cell types. These hierarchical CL annotations (based on "is-a" relationships) provide a natural structure that can be leveraged to construct hierarchies for fractal embeddings.
The complete dataset, including all 23 million embeddings, is available in TileDB format. The labeled training subset is also provided with:
- An hnswlib kNN index for efficient similarity search
- A corresponding reference labels file
All of these resources—along with the SCimilarity model and an IVFPQ implementation—have been integrated into a fully client-side web application called CytoVerse. CytoVerse enables in-browser exploration, similarity search, and visualization of cells in embedding space.
The Cell Ontology (CL) is a Directed Acyclic Graph (DAG) with ~3,200 cell type terms connected by is_a relationships. Our 203 cell types sit at varying depths in this DAG. To create a strict 4-level tree suitable for hierarchical loss training, we:
-
Map names to CL IDs -- Parse
cl-basic.obowithobonetand match each of the 203 cell type names to their CL identifiers via canonical names and synonyms. 201/203 match automatically; "native cell" and "animal cell" are obsolete terms requiring manual overrides. -
Define anchor nodes at Level 1 (System) and Level 2 (Lineage) as fixed snap-points in the ontology:
- Level 1 (System): Immune (
CL:0000988), Epithelial (CL:0000066), Neural (CL:0002319), Endothelial (CL:0000115), Stromal (CL:0002320), Muscle (CL:0000187), with Stem/Progenitor (CL:0000034) as a fallback - Level 2 (Lineage): Lymphoid, Myeloid, Innate Lymphoid, Neuron, Glial, Fibroblast, Stromal, Adipocyte, Smooth/Cardiac/Skeletal Muscle
- Level 1 (System): Immune (
-
Traverse the DAG for each cell type using a two-pass anchor matching strategy:
- For each CL ID, find all ontology ancestors via
nx.descendants()(obonet edges are child-to-parent) - Pass 1: Find the nearest primary Level 1 anchor by shortest path length
- Pass 2: Only if no primary anchor is reachable, fall back to Stem/Progenitor. This avoids the "stem cell trap" where many differentiated cells (monocytes, NK cells, etc.) have short paths to
stem cellviaprogenitor cell - Level 3 (Type): The immediate parent on the shortest path from the leaf to its Level 2 anchor
- Level 4 (Subtype): The original 203 cell type names
- For each CL ID, find all ontology ancestors via
-
Export 7,913,892 embeddings from the hnswlib kNN annotation index with their 4-level hierarchy labels as a single Parquet file with dictionary-encoded categorical columns.
Install python dependencies and create a virtual env:
git submodule update --init --recursive
uv venv
source .venv/bin/activate
uv sync
Create a ./data/ folder and download and unpack the scimilarity model and dataset (~30GB) into data/models/scimilarity/model_v1.1.
Convert the SCimilarity embeddings and flat labels and convert into a 4 level hierarchy.
uv run python ingest.py
Output: data/scimilarity_embeddings.parquet (3.80 GB)
| Column | Type | Description | Examples |
|---|---|---|---|
embedding |
fixed_size_list<float32>[128] |
128-dim SCimilarity embedding | |
level1_system |
dictionary<string> |
8 categories | Immune, Neural, Epithelial, Muscle |
level2_lineage |
dictionary<string> |
17 categories | Myeloid, Lymphoid, Neuron, Fibroblast |
level3_type |
dictionary<string> |
~120 categories | monocyte, glutamatergic neuron, cardiac muscle cell |
level4_subtype |
dictionary<string> |
203 categories | classical monocyte, CD8-positive T cell, astrocyte |
7,913,892 cells across 203 cell types, mapped to 8 Level 1 systems:
| System | Cells | % |
|---|---|---|
| Immune | 4,308,236 | 54.4% |
| Epithelial | 956,129 | 12.1% |
| Neural | 635,314 | 8.0% |
| Stromal | 544,481 | 6.9% |
| Other | 466,874 | 5.9% |
| Muscle | 396,372 | 5.0% |
| Endothelial | 385,952 | 4.9% |
| Stem/Progenitor | 220,534 | 2.8% |
Pre-stratify the parquet for balanced sequential reads (round-robin interleave by L4 subtype):
uv run python stratify.py
Output: data/stratified.parquet -- same data, reordered so that taking the first N rows gives balanced representation across all 4 hierarchy levels.
Train fractal adapters at different hierarchy depths. The --num-levels flag controls how many levels the model learns to separate, and --scale-dim sets the embedding dimension per level:
uv run python train.py --num-levels 2 --scale-dim 64 --max-samples 100000 --epochs 20
uv run python eval.py --model-path data/fractal_adapter_2L.pt
The fractal adapter is working well. Across all three configurations, it consistently improves kNN accuracy over the original SCimilarity embeddings at every level of the hierarchy and it does so while learning a genuinely hierarchical structure in the embedding space.
Improves on the original embeddings across the board. All three fractal models match or beat the SCimilarity baselines at every level, with the biggest gains (+5 points) at the finer type and subtype levels.
More levels doesn't degrade coarse accuracy. System and lineage accuracy are virtually identical whether the model learns 2, 3, or 4 levels. The fractal structure adds resolution without sacrificing the broad strokes.
Steerability increases with depth. Truncating the embedding naturally shifts similarity toward coarser categories, and this effect strengthens as more levels are added. The 4-level model shows the clearest prefix specialization.
32d per scale is sufficient. The 3L model (32d/scale, 96d total) matches the 2L model (64d/scale, 128d total) on system and lineage, suggesting each hierarchy level compresses well into 32 dimensions.
All models trained on 100k stratified samples, evaluated on 5k held-out samples with 5-NN accuracy.
| Embedding | Dims | System | Lineage | Sil(L0) |
|---|---|---|---|---|
| Original | 128 | 0.938 | 0.916 | 0.311 |
| Fractal (full) | 128 | 0.943 | 0.927 | 0.538 |
| Fractal (prefix 1) | 64 | 0.944 | 0.926 | -- |
| Embedding | Dims | System | Lineage | Type | Sil(L0) |
|---|---|---|---|---|---|
| Original | 128 | 0.938 | 0.916 | 0.773 | 0.311 |
| Fractal (full) | 96 | 0.943 | 0.927 | 0.821 | 0.249 |
| Fractal (prefix 1) | 32 | 0.944 | 0.923 | 0.801 | -- |
| Fractal (prefix 2) | 64 | 0.943 | 0.927 | 0.821 | -- |
| Embedding | Dims | System | Lineage | Type | Subtype | Sil(L0) |
|---|---|---|---|---|---|---|
| Original | 128 | 0.938 | 0.916 | 0.773 | 0.749 | 0.311 |
| Fractal (full) | 128 | 0.943 | 0.927 | 0.822 | 0.800 | 0.185 |
| Fractal (prefix 1) | 32 | 0.941 | 0.917 | 0.787 | 0.762 | -- |
| Fractal (prefix 2) | 64 | 0.943 | 0.926 | 0.818 | 0.797 | -- |
| Fractal (prefix 3) | 96 | 0.944 | 0.927 | 0.821 | 0.799 | -- |