Skip to content

fix(embeddings): make CUDA batch size configurable for small GPUs#368

Merged
ASuresh0524 merged 2 commits into
mainfrom
fix/cuda-embedding-batch-size-344
Jun 8, 2026
Merged

fix(embeddings): make CUDA batch size configurable for small GPUs#368
ASuresh0524 merged 2 commits into
mainfrom
fix/cuda-embedding-batch-size-344

Conversation

@ASuresh0524

Copy link
Copy Markdown
Collaborator

Fixes #344. LEANN_CUDA_BATCH_SIZE / LEANN_MPS_BATCH_SIZE / LEANN_CPU_THREADS env vars and --embedding-batch-size CLI override hardcoded defaults; cap CUDA batch by free VRAM after model load and halve on OOM retry.

Checklist

  • Tests pass (uv run pytest)
  • Code formatted (ruff format and ruff check)
  • Pre-commit hooks pass (pre-commit run --all-files)

Fixes #344. LEANN_CUDA_BATCH_SIZE / LEANN_MPS_BATCH_SIZE / LEANN_CPU_THREADS
env vars and --embedding-batch-size CLI override hardcoded defaults; cap CUDA
batch by free VRAM after model load and halve on OOM retry.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ArtifexSystems

Copy link
Copy Markdown

Tested this on the exact small-GPU setup from #344 and it works — it turns the OOM into a clean build. 👍

Setup: NVIDIA RTX A1000 Laptop GPU (4 GB), WSL2, leann-core 0.3.7, torch 2.12.0+cu130, sentence-transformers 5.5.1, model BAAI/bge-base-en-v1.5. Ran the PR head (80e83da) against a realistic corpus of 384 × 512-token chunks.

Results

  • Env / CLI override — unset → 256 (prior default preserved); LEANN_CUDA_BATCH_SIZE=24 → 24; an invalid value warns and falls back; LEANN_CPU_THREADS=16 honored. ✅
  • VRAM cap — at ~3.2 GiB free, _cap_cuda_batch_by_vram(256) → 192; LEANN_CUDA_AUTO_BATCH=0 opts out → 256. ✅
  • End-to-end default path (compute_embeddings(..., is_build=True), no env set): auto-capped 256 → 153, that still OOM'd, and the halving retry dropped to 76 → succeeded (384, 768). A hard torch.OutOfMemoryError became a clean build with zero manual tuning. ✅

Two small notes from the run, in case they're useful:

  1. The VRAM heuristic is a touch optimistic for ≤4 GB. 0.35 × free ÷ ~6 MB/seq lands around 150–210 on a 4 GB card, which still OOMs here — the halving retry is what actually makes it land (at 76). The retry makes it robust regardless, but a more conservative factor (or sizing against the eager-attention batch × heads × seq² peak, which dominates the footprint) would let the common path succeed without leaning on the retry.
  2. The cap runs even when the batch size was set explicitly. _cap_cuda_batch_by_vram applies whenever device == "cuda", so a user who deliberately passes --embedding-batch-size N (which sets adaptive_optimization=False) still gets it reduced unless they also set LEANN_CUDA_AUTO_BATCH=0. Might be friendlier to skip the auto-cap when the value came from an explicit flag.

Neither blocks anything — this is a complete fix for #344: env/CLI override and VRAM autoscale + OOM retry, with no new dependency. Once it merges I'll drop our local monkey-patch in favor of these knobs. Thanks @ASuresh0524!

@ASuresh0524

Copy link
Copy Markdown
Collaborator Author

Awesome, thanks @ArtifexSystems. That helps a lot! Will incorporate these two small changes.

Address PR #368 review: use a more conservative seq^2-based VRAM estimate
so small GPUs land near batch 76 without relying on OOM retry, and only
auto-cap when batch size comes from adaptive defaults (not --embedding-batch-size).

Co-authored-by: Cursor <cursoragent@cursor.com>
@ASuresh0524 ASuresh0524 removed the request for review from yichuan-w June 8, 2026 19:26
@ASuresh0524 ASuresh0524 merged commit 0ddc20f into main Jun 8, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA embedding batch size hardcoded to 256 — OOMs on small GPUs with no override

2 participants