[ENH] Add maxscore index to metadata segment by Sicheng-Pan · Pull Request #6880 · chroma-core/chroma

Sicheng-Pan · 2026-04-10T20:36:52Z

Description of changes

This is PR 7 in the MaxScore stack. It wires MaxScoreWriter/MaxScoreReader into the metadata segment so that collections with algorithm: "max_score" in their schema use the new index format on both the write and read paths. The query pipeline is not yet connected — that follows in PR 8.

MaxScoreReader::count_postings() (maxscore.rs): New method that counts total posting entries for a dimension by summing len() across posting blocks. O(n_blocks). Used by the IDF operator in PR 8 to compute document frequency for BM25 scoring.
Metadata segment writer (blockfile_metadata.rs):
- Added maxscore_index_writer: Option<MaxScoreWriter> field to MetadataSegmentWriterShard.
- Added schema: Option<&Schema> parameter to both MetadataSegmentWriter::from_segment() and MetadataSegmentWriterShard::from_segment().
- 3-way branch in writer construction:
  1. SPARSE_POSTING in file_path → fork MaxScore index (open reader + forked writer)
  2. SPARSE_MAX in file_path → fork existing WAND index (unchanged)
  3. Neither (fresh collection) → check schema.is_maxscore_enabled() to decide which writer to create
- Only one of sparse_index_writer / maxscore_index_writer is Some at a time.
- Dual dispatch in set_metadata/delete_metadata SparseVector arms — checks maxscore_index_writer first, falls back to sparse_index_writer.
Metadata segment flusher (blockfile_metadata.rs):
- Changed sparse_index_flusher: SparseFlusher to Option<SparseFlusher> + Option<MaxScoreFlusher>.
- commit() handles both writer paths; flush() conditionally inserts SPARSE_POSTING or SPARSE_MAX+SPARSE_OFFSET_VALUE into the flushed file_path map.
Metadata segment reader (blockfile_metadata.rs):
- Added maxscore_index_reader: Option<MaxScoreReader> field to MetadataSegmentReaderShard.
- SPARSE_POSTING blockfile loaded concurrently in the existing tokio::join!. If present, maxscore_index_reader is populated and the old WAND reader is skipped.
Call site updates (~27 sites):
- 2 production orchestrators (log_fetch_orchestrator.rs, attached_function_orchestrator.rs) pass collection.schema.as_ref().
- create_new_shard extracts schema from &Collection.
- ~24 test sites pass None (backward-compatible — default WAND path).

Test plan

All existing metadata segment tests pass unchanged (they pass None for schema, so the WAND writer is created as before).
Compilation verified for chroma-segment and worker crates.
Integration tests for the MaxScore write-then-read path will be added in PR 8 alongside the operator and orchestrator routing.

Migration plan

No migration needed. Existing collections with SPARSE_MAX+SPARSE_OFFSET_VALUE in their file_path continue to use the WAND reader/writer. New collections only get the MaxScore index if the schema has algorithm: "max_score" (set by the frontend gating in PR 6). The segment reader auto-detects which format is present based on file_path keys.

Observability plan

No new metrics or spans. The 3-way branch in from_segment() is logged implicitly through existing tracing on blockfile open/create operations.

Documentation Changes

None.

github-actions · 2026-04-10T20:37:01Z

Sicheng-Pan · 2026-04-10T20:37:12Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

[ENH] Wire maxscore reader in search #6899
[ENH] Add maxscore index to metadata segment #6880 👈 (View in Graphite)
[ENH] Add maxscore option in schema #6878
[ENH] Benchmark maxscore #6866
[ENH] Add SIMD for maxscore #6865
[ENH] Add maxscore lazy cursor #6829
[ENH] Add basic maxscore writer/reader #6825
[ENH] Add SparsePostingBlock #6823
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

propel-code-bot · 2026-04-10T20:38:00Z

MaxScore Sparse Index Wiring into Metadata Segments with Dual-Format Read/Write Support

This PR introduces end-to-end metadata-segment support for the new sparse MaxScore index format while preserving backward compatibility with the existing WAND sparse format. The core change is a new sparse index abstraction in rust/segment/src/blockfile_metadata.rs that routes writer/reader/flusher behavior based on existing segment file_path keys (SPARSE_POSTING vs SPARSE_MAX/SPARSE_OFFSET_VALUE) or schema settings for fresh collections via schema.is_maxscore_enabled().

It also updates execution call paths and operators so current query operators explicitly use WAND readers only, avoiding accidental use of MaxScore before full query-pipeline integration. Additional coverage includes a substantial multi-commit consistency test for MaxScore in metadata segments, plus a new MaxScoreReader posting-count method used by upcoming IDF/BM25 work.

This summary was automatically generated by @propel-code-bot

propel-code-bot

Review found no issues; changes appear consistent, backward-compatible, and safely gated for current query paths.

Status: No Issues Found | Risk: Low

Review Details

📁 9 files reviewed | 💬 0 comments

Sicheng-Pan mentioned this pull request Apr 10, 2026

[ENH] Add maxscore option in schema #6878

Open

This was referenced Apr 10, 2026

[ENH] Add SparsePostingBlock #6823

Open

[ENH] Add basic maxscore writer/reader #6825

Open

[ENH] Add maxscore lazy cursor #6829

Open

[ENH] Add SIMD for maxscore #6865

Open

[ENH] Benchmark maxscore #6866

Open

Sicheng-Pan changed the title ~~Wire MaxScore index into metadata segment writer, reader, and flusher~~ [ENH] Add maxscore index to metadata segment Apr 10, 2026

Sicheng-Pan marked this pull request as ready for review April 10, 2026 20:37

propel-code-bot bot reviewed Apr 10, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Sicheng-Pan force-pushed the hammad/maxscore_segment_wiring branch 3 times, most recently from 3f7caf7 to f440c26 Compare April 13, 2026 17:56

Sicheng-Pan force-pushed the hammad/maxscore_schema_gating branch 2 times, most recently from 537161f to f75087d Compare April 13, 2026 18:19

Sicheng-Pan force-pushed the hammad/maxscore_segment_wiring branch from f440c26 to 0d36e48 Compare April 13, 2026 18:19

Sicheng-Pan mentioned this pull request Apr 13, 2026

[ENH] Wire maxscore reader in search #6899

Open