Skip to content

[ENH]: Enable rebuilds for sharded collections#6916

Merged
tanujnay112 merged 5 commits intomainfrom
rebuild_sharding
Apr 17, 2026
Merged

[ENH]: Enable rebuilds for sharded collections#6916
tanujnay112 merged 5 commits intomainfrom
rebuild_sharding

Conversation

@tanujnay112
Copy link
Copy Markdown
Contributor

@tanujnay112 tanujnay112 commented Apr 15, 2026

Description of changes

This change adds rebuild capabilities for sharded collections. A rebuild command can now specify a shard index it wants to rebuild. An unspecified one defaults to 0 (first shard). This shard's filepaths are set to empty during rebuild compaction, analogous to all of the segment's filepaths being set to empty before this change. A new struct packaging the rebuild command parameters called RebuildInfo is passed to CompactionContext now. This contains the desired shard index to be rebuilt.

  • Improvements & Bug fixes
    • ...
  • New functionality
    • ...

Test plan

Tests have been added in compact.rs. They test rebuilding various shards in the full and partial rebuild scenarios.

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_

@github-actions
Copy link
Copy Markdown

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@tanujnay112 tanujnay112 marked this pull request as ready for review April 15, 2026 20:02
Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review found no issues; shard-aware rebuild enhancements appear well-implemented and sufficiently covered.

Status: No Issues Found | Risk: Low

Review Details

📁 12 files reviewed | 💬 0 comments

@propel-code-bot
Copy link
Copy Markdown
Contributor

propel-code-bot bot commented Apr 15, 2026

Enable shard-targeted rebuilds for sharded collections across compaction orchestration

This PR adds shard-aware rebuild functionality end-to-end in the Rust compaction pipeline. Rebuild requests can now optionally include a shard index via RebuildRequest.shard_index and CLI --shard, with default behavior targeting shard 0 when unspecified. Rebuild intent is refactored into a new RebuildInfo object that flows through CompactionContext, LogFetchOrchestrator, and CompactionManager instead of separate is_rebuild and apply_segment_scopes fields.

The rebuild execution path now operates on specific shards by selecting SegmentShard using the requested index, materializing only that shard while emitting empty placeholders for other shard slots, and clearing file paths only for the targeted shard using Segment.clear_shard_file_paths(). The PR also updates segment/shard helpers, prefetch behavior, and extensive integration-style tests in rust/worker/src/execution/orchestration/compact.rs to validate vector-only and all-segment rebuilds across multi-shard scenarios.

This summary was automatically generated by @propel-code-bot

Copy link
Copy Markdown
Contributor Author

tanujnay112 commented Apr 15, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@tanujnay112 tanujnay112 requested a review from sanketkedia April 15, 2026 20:09
@tanujnay112 tanujnay112 force-pushed the rebuild_sharding branch 3 times, most recently from d0bdf23 to e98b237 Compare April 16, 2026 07:29
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

segment_scopes: Vec<String>,
/// Specify which shard to rebuild (defaults to 0)
#[arg(long)]
shard: Option<u32>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have the flexibility of specifying different shards for different collections of the batch so Vec<Option<u32>>?

// Segment scopes to rebuild. If empty, rebuilds all segments (metadata + vector).
repeated SegmentScope segment_scopes = 2;
// Optional shard index to rebuild. If not specified, defaults to shard 0.
optional uint32 shard_index = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a batch api, why not have this a repeated field?

)
}
let log_task = if self.context.is_rebuild() {
let shard_count = output
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there should be a num_shards method in Segment type that you can use here

)
.await;

if record_segment_reader.is_none() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this even correct? for e.g. if a collection is compacted for the first time then this will not fetch the logs and return early?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_segment from above always returned a reader so we never hit this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you linked is of RecordSegmentWriterShard and not of RecordSegmentReader

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I found the relevant that returns. It returns a Vec<Option<RecordSegmentReaderShard>> so it will return [None] for one shard and uninitialized reader

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you should still keep it for e.g. if ok_or_terminate returns None because the reader creation error'd out

// Prefetch segments
let prefetch_segments = match self.context.is_rebuild {
let prefetch_segments = match self.context.is_rebuild() {
true => vec![output.record_segment],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should prefetch the shard that is getting rebuilt here instead of the active shard

}

#[async_trait]
impl Handler<TaskResult<SourceRecordSegmentOutput, SourceRecordSegmentError>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a todo to clean up V1


// Create a vector with empty shards for all positions except shard_index
let mut shards = vec![];
for i in 0..self.shard_count {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the below block of code are exactly the same so can extract into a helper

@@ -935,4 +925,4 @@
collection_info.collection.total_records_post_compaction = output.total_records as u64;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be wrong since total_records is only of that shard

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this might be a bit hard to compute now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just not compute it

collection_info.collection.total_records_post_compaction = output.total_records as u64;

// If no records, terminate early
if output.partitions.is_empty() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code needs to change now

ids: id.iter().map(ToString::to_string).collect(),
}),
segment_scopes: proto_scopes,
shard_index: *shard,
Copy link
Copy Markdown
Contributor

@sanketkedia sanketkedia Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why deref here? seems unnecessary

ids: id.iter().map(ToString::to_string).collect(),
}),
segment_scopes: proto_scopes,
shard_index: *shard,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also would it be simpler for the server to just have an u32 instead of Option and in the client you set to 0 if None?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively doing this at the server level. I think it makes sense to different on the client-server layer of whether or not a shard was specified. Say in the future you want a multi-shard rebuild API

// TODO(tanujnay112): This is awful, we need to find a better way to pass
// the active collection info around.
self.collection_info = OnceCell::new();
let rebuild_info = if is_getting_compacted_logs {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit weird. Clarifying my understanding - Is it to prevent abstraction violation that the attached functions call does not have to construct a RebuildInfo struct? Because you are using RebuildInfo as Some to implicitly mean "fetch logs from record segment" in LogFetchOrchestrator

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for abstraction boundaries. In the attached function backfill path I need to signal that backfilling is happening this way without trying to break the abstraction boundaries you mentioned.

blockfile_provider: &BlockfileProvider,
) {
// Get offset IDs from vector segment
// Get offset IDs from vector segment using distributed reader (handles all shards)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can remove

}

// Also create RecordSegmentReaderShard for source operators
let record_segment_shard = match self
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the construction of this segment shard and its reader should be done in the conditional below in the if self.context.is_rebuild() path. No need to pay this cost in the general non rebuild case

vec![output.record_segment]
} else {
let mut segments = vec![output.metadata_segment, output.record_segment];
if vector_segment.r#type == chroma_types::SegmentType::Spann {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also prefetch QuantizedSpann here

Comment thread rust/types/src/segment.rs Outdated
let vec = vec![self.new_shard()];
return Ok(vec);
}
// If there are no file paths, return a single empty shard
Copy link
Copy Markdown
Contributor

@sanketkedia sanketkedia Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Segment type also has num_shards() method. should use that here. It already has the validation logic for shard count

})
.collect();
MaterializeLogOutput {
result: PartitionedMaterializeLogsResult { shards },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a cleanup we should rename this to Sharded* since Partition is an existing concept in the codebase which is different from Sharding

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in a followup change

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, added to the burndown doc

@tanujnay112 tanujnay112 requested a review from sanketkedia April 17, 2026 18:52
Comment thread rust/types/src/segment.rs Outdated
pub fn get_shards(&self) -> Result<Vec<SegmentShard>, SegmentShardError> {
// If there are no file paths, return empty vector
let num_shards = self.num_shards()?;
if self.file_path.is_empty() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case should already be handled in the SegmentShard::try_from. can you pls double check?

vec![output.record_segment]
} else {
let mut segments = vec![output.metadata_segment, output.record_segment];
if vector_segment.r#type != chroma_types::SegmentType::HnswDistributed {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work because it will try to prefetch the local hnsw or sqlite segment types

@sanketkedia
Copy link
Copy Markdown
Contributor

These changes need so much care, great job at that

@tanujnay112 tanujnay112 merged commit fdcd216 into main Apr 17, 2026
122 of 124 checks passed
tanujnay112 added a commit that referenced this pull request Apr 19, 2026
This PR cherry-picks the commit fdcd216
onto release/2026-04-03. If there are unresolved conflicts, please
resolve them manually.

Co-authored-by: tanujnay112 <tanujnay112@live.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants