Skip to content

[ENH] Add maxscore index to metadata segment#6880

Open
Sicheng-Pan wants to merge 3 commits intohammad/maxscore_schema_gatingfrom
hammad/maxscore_segment_wiring
Open

[ENH] Add maxscore index to metadata segment#6880
Sicheng-Pan wants to merge 3 commits intohammad/maxscore_schema_gatingfrom
hammad/maxscore_segment_wiring

Conversation

@Sicheng-Pan
Copy link
Copy Markdown
Contributor

@Sicheng-Pan Sicheng-Pan commented Apr 10, 2026

Description of changes

This is PR 7 in the MaxScore stack. It wires MaxScoreWriter/MaxScoreReader into the metadata segment so that collections with algorithm: "max_score" in their schema use the new index format on both the write and read paths. The query pipeline is not yet connected — that follows in PR 8.

  • MaxScoreReader::count_postings() (maxscore.rs): New method that counts total posting entries for a dimension by summing len() across posting blocks. O(n_blocks). Used by the IDF operator in PR 8 to compute document frequency for BM25 scoring.
  • Metadata segment writer (blockfile_metadata.rs):
    • Added maxscore_index_writer: Option<MaxScoreWriter> field to MetadataSegmentWriterShard.
    • Added schema: Option<&Schema> parameter to both MetadataSegmentWriter::from_segment() and MetadataSegmentWriterShard::from_segment().
    • 3-way branch in writer construction:
      1. SPARSE_POSTING in file_path → fork MaxScore index (open reader + forked writer)
      2. SPARSE_MAX in file_path → fork existing WAND index (unchanged)
      3. Neither (fresh collection) → check schema.is_maxscore_enabled() to decide which writer to create
    • Only one of sparse_index_writer / maxscore_index_writer is Some at a time.
    • Dual dispatch in set_metadata/delete_metadata SparseVector arms — checks maxscore_index_writer first, falls back to sparse_index_writer.
  • Metadata segment flusher (blockfile_metadata.rs):
    • Changed sparse_index_flusher: SparseFlusher to Option<SparseFlusher> + Option<MaxScoreFlusher>.
    • commit() handles both writer paths; flush() conditionally inserts SPARSE_POSTING or SPARSE_MAX+SPARSE_OFFSET_VALUE into the flushed file_path map.
  • Metadata segment reader (blockfile_metadata.rs):
    • Added maxscore_index_reader: Option<MaxScoreReader> field to MetadataSegmentReaderShard.
    • SPARSE_POSTING blockfile loaded concurrently in the existing tokio::join!. If present, maxscore_index_reader is populated and the old WAND reader is skipped.
  • Call site updates (~27 sites):
    • 2 production orchestrators (log_fetch_orchestrator.rs, attached_function_orchestrator.rs) pass collection.schema.as_ref().
    • create_new_shard extracts schema from &Collection.
    • ~24 test sites pass None (backward-compatible — default WAND path).

Test plan

  • All existing metadata segment tests pass unchanged (they pass None for schema, so the WAND writer is created as before).
  • Compilation verified for chroma-segment and worker crates.
  • Integration tests for the MaxScore write-then-read path will be added in PR 8 alongside the operator and orchestrator routing.

Migration plan

No migration needed. Existing collections with SPARSE_MAX+SPARSE_OFFSET_VALUE in their file_path continue to use the WAND reader/writer. New collections only get the MaxScore index if the schema has algorithm: "max_score" (set by the frontend gating in PR 6). The segment reader auto-detects which format is present based on file_path keys.

Observability plan

No new metrics or spans. The 3-way branch in from_segment() is logged implicitly through existing tracing on blockfile open/create operations.

Documentation Changes

None.

@github-actions
Copy link
Copy Markdown

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Copy Markdown
Contributor Author

Sicheng-Pan commented Apr 10, 2026

@Sicheng-Pan Sicheng-Pan changed the title Wire MaxScore index into metadata segment writer, reader, and flusher [ENH] Add maxscore index to metadata segment Apr 10, 2026
@Sicheng-Pan Sicheng-Pan marked this pull request as ready for review April 10, 2026 20:37
@propel-code-bot
Copy link
Copy Markdown
Contributor

propel-code-bot bot commented Apr 10, 2026

MaxScore Sparse Index Wiring into Metadata Segments with Dual-Format Read/Write Support

This PR introduces end-to-end metadata-segment support for the new sparse MaxScore index format while preserving backward compatibility with the existing WAND sparse format. The core change is a new sparse index abstraction in rust/segment/src/blockfile_metadata.rs that routes writer/reader/flusher behavior based on existing segment file_path keys (SPARSE_POSTING vs SPARSE_MAX/SPARSE_OFFSET_VALUE) or schema settings for fresh collections via schema.is_maxscore_enabled().

It also updates execution call paths and operators so current query operators explicitly use WAND readers only, avoiding accidental use of MaxScore before full query-pipeline integration. Additional coverage includes a substantial multi-commit consistency test for MaxScore in metadata segments, plus a new MaxScoreReader posting-count method used by upcoming IDF/BM25 work.

This summary was automatically generated by @propel-code-bot

Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review found no issues; changes appear consistent, backward-compatible, and safely gated for current query paths.

Status: No Issues Found | Risk: Low

Review Details

📁 9 files reviewed | 💬 0 comments

@blacksmith-sh

This comment has been minimized.

@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_segment_wiring branch 3 times, most recently from 3f7caf7 to f440c26 Compare April 13, 2026 17:56
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_schema_gating branch 2 times, most recently from 537161f to f75087d Compare April 13, 2026 18:19
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_segment_wiring branch from f440c26 to 0d36e48 Compare April 13, 2026 18:19
Comment thread rust/index/src/sparse/maxscore.rs Outdated
Comment thread rust/segment/src/blockfile_metadata.rs Outdated
Comment thread rust/segment/src/blockfile_metadata.rs Outdated
Comment thread rust/segment/src/blockfile_metadata.rs Outdated
Comment thread rust/segment/src/blockfile_metadata.rs Outdated
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_segment_wiring branch from 0d36e48 to 644f6bb Compare April 14, 2026 22:04
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_schema_gating branch from f75087d to 0c10ab4 Compare April 14, 2026 22:04
@blacksmith-sh

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant