fix(index): fail fast when FTS posting pipeline would deadlock by hfutatzhanghb · Pull Request #7350 · lance-format/lance

hfutatzhanghb · 2026-06-18T02:27:55Z

Summary

Reject FTS inverted index builds early when fewer than 2 lance-cpu blocking threads are available, instead of hanging silently after logging writing N posting lists.
Return a descriptive invalid_input error that explains the deadlock, mentions LANCE_CPU_THREADS, and suggests setting it to at least 2.
Add a tokio::sync::Semaphore that limits concurrent write_posting_lists_pipelined calls to available_cpu_threads - 1, ensuring at least one spawn_cpu thread is always free for the consumer's nested page encoding inside FileWriter::write_batch. This prevents the concurrent-worker deadlock when num_workers equals all available CPU threads.
Move the CPU-thread guard from InvertedIndexBuilder::update_index into InnerBuilder::write_posting_lists so that updates that never write posting lists (e.g., empty new_data with only deleted-fragment metadata) are not rejected.
Add regression tests for the error message, for the concurrent-worker semaphore serialization, for semaphore permit scaling, and for the empty-update path.

Root cause

write_posting_lists runs batch encoding on spawn_cpu while FileWriter::write_batch also submits column page encoding via spawn_cpu. With only one blocking thread, the producer blocks on the bounded channel while the consumer waits for nested encoding, causing a deadlock with no further log output.

With ≥2 threads but num_workers equal to all available threads, all producers occupy all spawn_cpu threads and consumers deadlock identically. The semaphore prevents this by serializing flush-phase concurrency.

Test plan

cargo test -p lance-index test_fts_posting_pipeline_cpu_threads_error_message
cargo test -p lance-index test_fts_posting_pipeline_write_posting_lists_deadlocks_with_one_cpu_thread
cargo test -p lance-index test_empty_update_with_one_cpu_thread_records_deleted_fragments
LANCE_CPU_THREADS=2 cargo test -p lance-index test_fts_posting_semaphore_serializes_with_two_cpu_threads
LANCE_CPU_THREADS=4 cargo test -p lance-index test_fts_posting_semaphore_permits_scale_with_threads
cargo fmt --all
cargo clippy -p lance-index --lib --tests -- -D warnings

When only one lance-cpu blocking thread is available, the pipelined FTS posting-list writer deadlocks silently after logging "writing N posting lists". Reject the build early with a descriptive error and add regression tests that reproduce the deadlock in a child process. Co-authored-by: Cursor <cursoragent@cursor.com>

hfutatzhanghb · 2026-06-18T02:37:43Z

@Xuanwo @BubbleCal @yanghua Hi, please help review this PR when have free time, Thanks very much!

codecov · 2026-06-18T03:25:47Z

Codecov Report

❌ Patch coverage is 93.33333% with 12 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/inverted/builder.rs	93.33%	8 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

hfutatzhanghb · 2026-06-18T03:40:56Z

@claude review this pr.

BubbleCal

Requesting changes for two correctness issues in the CPU-thread deadlock guard.

BubbleCal · 2026-06-18T06:31:31Z

+/// also submits column page encoding via `spawn_cpu`. With only one `lance-cpu` blocking
+/// thread the producer blocks on the bounded channel while the writer waits for encoding,
+/// which deadlocks with no further log output.
+const MIN_CPU_THREADS_FOR_FTS_POSTING_PIPELINE: usize = 2;


A fixed threshold of 2 threads does not cover concurrent worker flushes. resolve_num_workers can still choose all available CPU threads, so with LANCE_CPU_THREADS=2 two workers can each occupy a spawn_cpu producer and then both writers wait for nested page-encoding spawn_cpu work with no free thread. This should reserve capacity for nested encoding or limit the number of concurrent posting writers/workers.

@BubbleCal Thanks for the review. Addressed the threshold concern using a tokio::sync::Semaphore:

Fix: Added fts_posting_write_semaphore() with available_cpu_threads - 1 permits (at least 1). write_posting_lists now acquires a permit before calling write_posting_lists_pipelined, so at most available - 1 producers hold spawn_cpu threads concurrently, always leaving one free for the consumer's nested page encoding.

This handles the concurrent-worker case correctly:

LANCE_CPU_THREADS=2, num_workers=2: semaphore has 1 permit → only 1 writer runs at a time → no deadlock.

LANCE_CPU_THREADS=4, num_workers=4: semaphore has 3 permits → up to 3 writers concurrently, 1 thread remains free for consumers.

Two new tests verify the semaphore behavior in child processes with LANCE_CPU_THREADS=2 and LANCE_CPU_THREADS=4.

…dlock Add a tokio::sync::Semaphore that limits concurrent write_posting_lists_pipelined calls to available_cpu_threads - 1, ensuring at least one spawn_cpu thread is always free for the consumer's nested page encoding inside FileWriter::write_batch. Previously only an early guard rejected configurations with fewer than 2 lance-cpu threads, but with LANCE_CPU_THREADS=2 and num_workers=2, both workers could occupy both spawn_cpu threads as producers, deadlocking when consumers needed a thread for page encoding. The semaphore fixes the concurrent-worker deadlock case while preserving the early guard for hopeless single-thread configs.

Xuanwo

The new semaphore addresses the multi-worker flush case, but the one-CPU guard is still too broad. write_posting_lists calls check_fts_posting_pipeline_cpu_threads() before it knows whether the write can actually deadlock, so even a single-batch posting-list write is rejected under LANCE_CPU_THREADS=1.

A single-batch producer can send the only batch and release the sole spawn_cpu thread before the consumer encodes pages, so this rejects small valid FTS builds. In the review run, LANCE_CPU_THREADS=1 cargo test -p lance-index scalar::inverted::builder::tests::test_write_posting_lists_batches_multiple_rows -- --exact --nocapture failed from this guard.

Please either use a serial/non-pipelined fallback when only one CPU thread is available, or restructure the producer so it does not block inside the sole spawn_cpu thread before rejecting all small FTS builds in that configuration.

github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working labels Jun 18, 2026

Merge branch 'main' into fix/index-fts-posting-pipeline-deadlock-guard

9a277db

BubbleCal requested changes Jun 18, 2026

View reviewed changes

zhanghaobo@kanzhun.com added 2 commits June 18, 2026 16:07

fix(index): scope FTS CPU guard to posting writes

0a3980d

Xuanwo requested changes Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(index): fail fast when FTS posting pipeline would deadlock#7350

fix(index): fail fast when FTS posting pipeline would deadlock#7350
hfutatzhanghb wants to merge 4 commits into
lance-format:mainfrom
hfutatzhanghb:fix/index-fts-posting-pipeline-deadlock-guard

hfutatzhanghb commented Jun 18, 2026 •

edited

Loading

Uh oh!

hfutatzhanghb commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

hfutatzhanghb commented Jun 18, 2026

Uh oh!

BubbleCal left a comment

Uh oh!

BubbleCal Jun 18, 2026

Uh oh!

hfutatzhanghb Jun 19, 2026

Uh oh!

Uh oh!

Xuanwo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hfutatzhanghb commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Test plan

Uh oh!

hfutatzhanghb commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hfutatzhanghb commented Jun 18, 2026

Uh oh!

BubbleCal left a comment

Choose a reason for hiding this comment

Uh oh!

BubbleCal Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

hfutatzhanghb Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hfutatzhanghb commented Jun 18, 2026 •

edited

Loading

codecov Bot commented Jun 18, 2026 •

edited

Loading