Skip to content

rcurrie/frac-bio-embed

Repository files navigation

Fractal Biological Embeddings

Embeddings generated by AI models are typically Superimposed Embeddings (Multimodal/Fixed-Size). The concepts they represent overlap heavily, making meaningful interpretation or truncation impossible. Variable Resolution Embeddings (Multi-Scale/Adaptive), also known as Matryoshka Representation Learning (MRL), reorganize the embedding space so you can truncate while still retaining meaning. However, this is essentially a reduction in resolution (think of it like blurring an image). Devansh developed Fractal Embeddings which restructure the embedding space into a hierarchy. Truncation now allows moving up to a higher conceptual level while preserving full resolution and accuracy at that level. This repository explores applying Fractal Embeddings to embeddings derived from biology.

TL;DR; The fractal adapter works well improving accuracy over the original embeddings while learning a hierarchical structure based on biology.

SCimilarity Dataset

A primary focus of foundation models in biology is single-cell RNA sequencing (scRNA-seq) data. These datasets typically represent each cell as a high-dimensional vector (~20,000 dimensions), where each entry corresponds to the expression level of a gene as a proxy for protein abundance.

Genentech has released SCimilarity, which includes:

  • A reference dataset of approximately 23 million single-cell profiles
  • A single-cell foundation model (scFM) that produces 128-dimensional embeddings per cell

The model was trained on a labeled subset of 7.9 million cells from 56 studies, annotated with Cell Ontology (CL) terms covering 203 unique cell types. These hierarchical CL annotations (based on "is-a" relationships) provide a natural structure that can be leveraged to construct hierarchies for fractal embeddings.

The complete dataset, including all 23 million embeddings, is available in TileDB format. The labeled training subset is also provided with:

  • An hnswlib kNN index for efficient similarity search
  • A corresponding reference labels file

All of these resources—along with the SCimilarity model and an IVFPQ implementation—have been integrated into a fully client-side web application called CytoVerse. CytoVerse enables in-browser exploration, similarity search, and visualization of cells in embedding space.

Ingesting and Materializing a Hierarchy

The Cell Ontology (CL) is a Directed Acyclic Graph (DAG) with ~3,200 cell type terms connected by is_a relationships. Our 203 cell types sit at varying depths in this DAG. To create a strict 4-level tree suitable for hierarchical loss training, we:

  1. Map names to CL IDs -- Parse cl-basic.obo with obonet and match each of the 203 cell type names to their CL identifiers via canonical names and synonyms. 201/203 match automatically; "native cell" and "animal cell" are obsolete terms requiring manual overrides.

  2. Define anchor nodes at Level 1 (System) and Level 2 (Lineage) as fixed snap-points in the ontology:

    • Level 1 (System): Immune (CL:0000988), Epithelial (CL:0000066), Neural (CL:0002319), Endothelial (CL:0000115), Stromal (CL:0002320), Muscle (CL:0000187), with Stem/Progenitor (CL:0000034) as a fallback
    • Level 2 (Lineage): Lymphoid, Myeloid, Innate Lymphoid, Neuron, Glial, Fibroblast, Stromal, Adipocyte, Smooth/Cardiac/Skeletal Muscle
  3. Traverse the DAG for each cell type using a two-pass anchor matching strategy:

    • For each CL ID, find all ontology ancestors via nx.descendants() (obonet edges are child-to-parent)
    • Pass 1: Find the nearest primary Level 1 anchor by shortest path length
    • Pass 2: Only if no primary anchor is reachable, fall back to Stem/Progenitor. This avoids the "stem cell trap" where many differentiated cells (monocytes, NK cells, etc.) have short paths to stem cell via progenitor cell
    • Level 3 (Type): The immediate parent on the shortest path from the leaf to its Level 2 anchor
    • Level 4 (Subtype): The original 203 cell type names
  4. Export 7,913,892 embeddings from the hnswlib kNN annotation index with their 4-level hierarchy labels as a single Parquet file with dictionary-encoded categorical columns.

Running

Install

Install python dependencies and create a virtual env:

git submodule update --init --recursive
uv venv
source .venv/bin/activate
uv sync

Create a ./data/ folder and download and unpack the scimilarity model and dataset (~30GB) into data/models/scimilarity/model_v1.1.

Ingest

Convert the SCimilarity embeddings and flat labels and convert into a 4 level hierarchy.

uv run python ingest.py

Output: data/scimilarity_embeddings.parquet (3.80 GB)

Column Type Description Examples
embedding fixed_size_list<float32>[128] 128-dim SCimilarity embedding
level1_system dictionary<string> 8 categories Immune, Neural, Epithelial, Muscle
level2_lineage dictionary<string> 17 categories Myeloid, Lymphoid, Neuron, Fibroblast
level3_type dictionary<string> ~120 categories monocyte, glutamatergic neuron, cardiac muscle cell
level4_subtype dictionary<string> 203 categories classical monocyte, CD8-positive T cell, astrocyte

7,913,892 cells across 203 cell types, mapped to 8 Level 1 systems:

System Cells %
Immune 4,308,236 54.4%
Epithelial 956,129 12.1%
Neural 635,314 8.0%
Stromal 544,481 6.9%
Other 466,874 5.9%
Muscle 396,372 5.0%
Endothelial 385,952 4.9%
Stem/Progenitor 220,534 2.8%

Stratify

Pre-stratify the parquet for balanced sequential reads (round-robin interleave by L4 subtype):

uv run python stratify.py

Output: data/stratified.parquet -- same data, reordered so that taking the first N rows gives balanced representation across all 4 hierarchy levels.

Train

Train fractal adapters at different hierarchy depths. The --num-levels flag controls how many levels the model learns to separate, and --scale-dim sets the embedding dimension per level:

uv run python train.py --num-levels 2 --scale-dim 64 --max-samples 100000 --epochs 20

Evaluate

uv run python eval.py --model-path data/fractal_adapter_2L.pt

Results

The fractal adapter is working well. Across all three configurations, it consistently improves kNN accuracy over the original SCimilarity embeddings at every level of the hierarchy and it does so while learning a genuinely hierarchical structure in the embedding space.

Key Findings

Improves on the original embeddings across the board. All three fractal models match or beat the SCimilarity baselines at every level, with the biggest gains (+5 points) at the finer type and subtype levels.

More levels doesn't degrade coarse accuracy. System and lineage accuracy are virtually identical whether the model learns 2, 3, or 4 levels. The fractal structure adds resolution without sacrificing the broad strokes.

Steerability increases with depth. Truncating the embedding naturally shifts similarity toward coarser categories, and this effect strengthens as more levels are added. The 4-level model shows the clearest prefix specialization.

32d per scale is sufficient. The 3L model (32d/scale, 96d total) matches the 2L model (64d/scale, 128d total) on system and lineage, suggesting each hierarchy level compresses well into 32 dimensions.

Summary Table

All models trained on 100k stratified samples, evaluated on 5k held-out samples with 5-NN accuracy.

2-Level Model (2 x 64d = 128d) -- System + Lineage

Embedding Dims System Lineage Sil(L0)
Original 128 0.938 0.916 0.311
Fractal (full) 128 0.943 0.927 0.538
Fractal (prefix 1) 64 0.944 0.926 --

3-Level Model (3 x 32d = 96d) -- System + Lineage + Type

Embedding Dims System Lineage Type Sil(L0)
Original 128 0.938 0.916 0.773 0.311
Fractal (full) 96 0.943 0.927 0.821 0.249
Fractal (prefix 1) 32 0.944 0.923 0.801 --
Fractal (prefix 2) 64 0.943 0.927 0.821 --

4-Level Model (4 x 32d = 128d) -- System + Lineage + Type + Subtype

Embedding Dims System Lineage Type Subtype Sil(L0)
Original 128 0.938 0.916 0.773 0.749 0.311
Fractal (full) 128 0.943 0.927 0.822 0.800 0.185
Fractal (prefix 1) 32 0.941 0.917 0.787 0.762 --
Fractal (prefix 2) 64 0.943 0.926 0.818 0.797 --
Fractal (prefix 3) 96 0.944 0.927 0.821 0.799 --

About

Explore fractal embeddings applied to scRNASeq embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors