[ENH] Add maxscore index to metadata segment#6880
[ENH] Add maxscore index to metadata segment#6880Sicheng-Pan wants to merge 3 commits intohammad/maxscore_schema_gatingfrom
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
This PR introduces end-to-end metadata-segment support for the new sparse It also updates execution call paths and operators so current query operators explicitly use This summary was automatically generated by @propel-code-bot |
This comment has been minimized.
This comment has been minimized.
3f7caf7 to
f440c26
Compare
537161f to
f75087d
Compare
f440c26 to
0d36e48
Compare
0d36e48 to
644f6bb
Compare
f75087d to
0c10ab4
Compare

Description of changes
This is PR 7 in the MaxScore stack. It wires
MaxScoreWriter/MaxScoreReaderinto the metadata segment so that collections withalgorithm: "max_score"in their schema use the new index format on both the write and read paths. The query pipeline is not yet connected — that follows in PR 8.MaxScoreReader::count_postings()(maxscore.rs): New method that counts total posting entries for a dimension by summinglen()across posting blocks. O(n_blocks). Used by the IDF operator in PR 8 to compute document frequency for BM25 scoring.blockfile_metadata.rs):maxscore_index_writer: Option<MaxScoreWriter>field toMetadataSegmentWriterShard.schema: Option<&Schema>parameter to bothMetadataSegmentWriter::from_segment()andMetadataSegmentWriterShard::from_segment().SPARSE_POSTINGin file_path → fork MaxScore index (open reader + forked writer)SPARSE_MAXin file_path → fork existing WAND index (unchanged)schema.is_maxscore_enabled()to decide which writer to createsparse_index_writer/maxscore_index_writerisSomeat a time.set_metadata/delete_metadataSparseVectorarms — checksmaxscore_index_writerfirst, falls back tosparse_index_writer.blockfile_metadata.rs):sparse_index_flusher: SparseFlushertoOption<SparseFlusher>+Option<MaxScoreFlusher>.commit()handles both writer paths;flush()conditionally insertsSPARSE_POSTINGorSPARSE_MAX+SPARSE_OFFSET_VALUEinto the flushed file_path map.blockfile_metadata.rs):maxscore_index_reader: Option<MaxScoreReader>field toMetadataSegmentReaderShard.SPARSE_POSTINGblockfile loaded concurrently in the existingtokio::join!. If present,maxscore_index_readeris populated and the old WAND reader is skipped.log_fetch_orchestrator.rs,attached_function_orchestrator.rs) passcollection.schema.as_ref().create_new_shardextracts schema from&Collection.None(backward-compatible — default WAND path).Test plan
Nonefor schema, so the WAND writer is created as before).chroma-segmentandworkercrates.Migration plan
No migration needed. Existing collections with
SPARSE_MAX+SPARSE_OFFSET_VALUEin their file_path continue to use the WAND reader/writer. New collections only get the MaxScore index if the schema hasalgorithm: "max_score"(set by the frontend gating in PR 6). The segment reader auto-detects which format is present based on file_path keys.Observability plan
No new metrics or spans. The 3-way branch in
from_segment()is logged implicitly through existing tracing on blockfile open/create operations.Documentation Changes
None.