Fractal Biological Embeddings

Embeddings generated by AI models are typically Superimposed Embeddings (Multimodal/Fixed-Size). The concepts they represent overlap heavily, making meaningful interpretation or truncation impossible. Variable Resolution Embeddings (Multi-Scale/Adaptive), also known as Matryoshka Representation Learning (MRL), reorganize the embedding space so you can truncate while still retaining meaning. However, this is essentially a reduction in resolution (think of it like blurring an image). Devansh developed Fractal Embeddings which restructure the embedding space into a hierarchy. Truncation now allows moving up to a higher conceptual level while preserving full resolution and accuracy at that level. This repository explores applying Fractal Embeddings to embeddings derived from biology.

TL;DR; The fractal adapter works well improving accuracy over the original embeddings while learning a hierarchical structure based on biology.

SCimilarity Dataset

A primary focus of foundation models in biology is single-cell RNA sequencing (scRNA-seq) data. These datasets typically represent each cell as a high-dimensional vector (~20,000 dimensions), where each entry corresponds to the expression level of a gene as a proxy for protein abundance.

Genentech has released SCimilarity, which includes:

A reference dataset of approximately 23 million single-cell profiles
A single-cell foundation model (scFM) that produces 128-dimensional embeddings per cell

The model was trained on a labeled subset of 7.9 million cells from 56 studies, annotated with Cell Ontology (CL) terms covering 203 unique cell types. These hierarchical CL annotations (based on "is-a" relationships) provide a natural structure that can be leveraged to construct hierarchies for fractal embeddings.

The complete dataset, including all 23 million embeddings, is available in TileDB format. The labeled training subset is also provided with:

An hnswlib kNN index for efficient similarity search
A corresponding reference labels file

All of these resources—along with the SCimilarity model and an IVFPQ implementation—have been integrated into a fully client-side web application called CytoVerse. CytoVerse enables in-browser exploration, similarity search, and visualization of cells in embedding space.

Ingesting and Materializing a Hierarchy

The Cell Ontology (CL) is a Directed Acyclic Graph (DAG) with ~3,200 cell type terms connected by is_a relationships. Our 203 cell types sit at varying depths in this DAG. To create a strict 4-level tree suitable for hierarchical loss training, we:

Map names to CL IDs -- Parse cl-basic.obo with obonet and match each of the 203 cell type names to their CL identifiers via canonical names and synonyms. 201/203 match automatically; "native cell" and "animal cell" are obsolete terms requiring manual overrides.
Define anchor nodes at Level 1 (System) and Level 2 (Lineage) as fixed snap-points in the ontology:
- Level 1 (System): Immune (CL:0000988), Epithelial (CL:0000066), Neural (CL:0002319), Endothelial (CL:0000115), Stromal (CL:0002320), Muscle (CL:0000187), with Stem/Progenitor (CL:0000034) as a fallback
- Level 2 (Lineage): Lymphoid, Myeloid, Innate Lymphoid, Neuron, Glial, Fibroblast, Stromal, Adipocyte, Smooth/Cardiac/Skeletal Muscle
Traverse the DAG for each cell type using a two-pass anchor matching strategy:
- For each CL ID, find all ontology ancestors via nx.descendants() (obonet edges are child-to-parent)
- Pass 1: Find the nearest primary Level 1 anchor by shortest path length
- Pass 2: Only if no primary anchor is reachable, fall back to Stem/Progenitor. This avoids the "stem cell trap" where many differentiated cells (monocytes, NK cells, etc.) have short paths to stem cell via progenitor cell
- Level 3 (Type): The immediate parent on the shortest path from the leaf to its Level 2 anchor
- Level 4 (Subtype): The original 203 cell type names
Export 7,913,892 embeddings from the hnswlib kNN annotation index with their 4-level hierarchy labels as a single Parquet file with dictionary-encoded categorical columns.

Running

Install

Install python dependencies and create a virtual env:

git submodule update --init --recursive
uv venv
source .venv/bin/activate
uv sync

Create a ./data/ folder and download and unpack the scimilarity model and dataset (~30GB) into data/models/scimilarity/model_v1.1.

Ingest

Convert the SCimilarity embeddings and flat labels and convert into a 4 level hierarchy.

uv run python ingest.py

Output: data/scimilarity_embeddings.parquet (3.80 GB)

Column	Type	Description	Examples
`embedding`	`fixed_size_list<float32>[128]`	128-dim SCimilarity embedding
`level1_system`	`dictionary<string>`	8 categories	Immune, Neural, Epithelial, Muscle
`level2_lineage`	`dictionary<string>`	17 categories	Myeloid, Lymphoid, Neuron, Fibroblast
`level3_type`	`dictionary<string>`	~120 categories	monocyte, glutamatergic neuron, cardiac muscle cell
`level4_subtype`	`dictionary<string>`	203 categories	classical monocyte, CD8-positive T cell, astrocyte

7,913,892 cells across 203 cell types, mapped to 8 Level 1 systems:

System	Cells	%
Immune	4,308,236	54.4%
Epithelial	956,129	12.1%
Neural	635,314	8.0%
Stromal	544,481	6.9%
Other	466,874	5.9%
Muscle	396,372	5.0%
Endothelial	385,952	4.9%
Stem/Progenitor	220,534	2.8%

Stratify

Pre-stratify the parquet for balanced sequential reads (round-robin interleave by L4 subtype):

uv run python stratify.py

Output: data/stratified.parquet -- same data, reordered so that taking the first N rows gives balanced representation across all 4 hierarchy levels.

Train

Train fractal adapters at different hierarchy depths. The --num-levels flag controls how many levels the model learns to separate, and --scale-dim sets the embedding dimension per level:

uv run python train.py --num-levels 2 --scale-dim 64 --max-samples 100000 --epochs 20

Evaluate

uv run python eval.py --model-path data/fractal_adapter_2L.pt

Results

The fractal adapter is working well. Across all three configurations, it consistently improves kNN accuracy over the original SCimilarity embeddings at every level of the hierarchy and it does so while learning a genuinely hierarchical structure in the embedding space.

Key Findings

Improves on the original embeddings across the board. All three fractal models match or beat the SCimilarity baselines at every level, with the biggest gains (+5 points) at the finer type and subtype levels.

More levels doesn't degrade coarse accuracy. System and lineage accuracy are virtually identical whether the model learns 2, 3, or 4 levels. The fractal structure adds resolution without sacrificing the broad strokes.

Steerability increases with depth. Truncating the embedding naturally shifts similarity toward coarser categories, and this effect strengthens as more levels are added. The 4-level model shows the clearest prefix specialization.

32d per scale is sufficient. The 3L model (32d/scale, 96d total) matches the 2L model (64d/scale, 128d total) on system and lineage, suggesting each hierarchy level compresses well into 32 dimensions.

Summary Table

All models trained on 100k stratified samples, evaluated on 5k held-out samples with 5-NN accuracy.

2-Level Model (2 x 64d = 128d) -- System + Lineage

Embedding	Dims	System	Lineage	Sil(L0)
Original	128	0.938	0.916	0.311
Fractal (full)	128	0.943	0.927	0.538
Fractal (prefix 1)	64	0.944	0.926	--

3-Level Model (3 x 32d = 96d) -- System + Lineage + Type

Embedding	Dims	System	Lineage	Type	Sil(L0)
Original	128	0.938	0.916	0.773	0.311
Fractal (full)	96	0.943	0.927	0.821	0.249
Fractal (prefix 1)	32	0.944	0.923	0.801	--
Fractal (prefix 2)	64	0.943	0.927	0.821	--

4-Level Model (4 x 32d = 128d) -- System + Lineage + Type + Subtype

Embedding	Dims	System	Lineage	Type	Subtype	Sil(L0)
Original	128	0.938	0.916	0.773	0.749	0.311
Fractal (full)	128	0.943	0.927	0.822	0.800	0.185
Fractal (prefix 1)	32	0.941	0.917	0.787	0.762	--
Fractal (prefix 2)	64	0.943	0.926	0.818	0.797	--
Fractal (prefix 3)	96	0.944	0.927	0.821	0.799	--

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.vscode		.vscode
fractal-embeddings @ eda18c5		fractal-embeddings @ eda18c5
results		results
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
.python-version		.python-version
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
embed.py		embed.py
eval.py		eval.py
ingest.py		ingest.py
job.yaml		job.yaml
pyproject.toml		pyproject.toml
stratify.py		stratify.py
test_s3_write.py		test_s3_write.py
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fractal Biological Embeddings

SCimilarity Dataset

Ingesting and Materializing a Hierarchy

Running

Install

Ingest

Stratify

Train

Evaluate

Results

Key Findings

Summary Table

2-Level Model (2 x 64d = 128d) -- System + Lineage

3-Level Model (3 x 32d = 96d) -- System + Lineage + Type

4-Level Model (4 x 32d = 128d) -- System + Lineage + Type + Subtype

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fractal Biological Embeddings

SCimilarity Dataset

Ingesting and Materializing a Hierarchy

Running

Install

Ingest

Stratify

Train

Evaluate

Results

Key Findings

Summary Table

2-Level Model (2 x 64d = 128d) -- System + Lineage

3-Level Model (3 x 32d = 96d) -- System + Lineage + Type

4-Level Model (4 x 32d = 128d) -- System + Lineage + Type + Subtype

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages