Skip to content

fix index status endpoint to reflect actual scan progress#404

Open
nyh wants to merge 1 commit into
scylladb:masterfrom
nyh:fix-595
Open

fix index status endpoint to reflect actual scan progress#404
nyh wants to merge 1 commit into
scylladb:masterfrom
nyh:fix-595

Conversation

@nyh
Copy link
Copy Markdown
Contributor

@nyh nyh commented Mar 29, 2026

fix index status endpoint to reflect actual scan progress

The GET /api/v1/indexes/{keyspace}/{index}/status endpoint was reporting
SERVING based solely on the node state machine, which transitions to
Serving when FullScanFinished is emitted. However, the ANN query endpoint
independently checks db_index.full_scan_progress() and returns 503 if the
scan is still InProgress.

These two checks could disagree. We noticed in Alternator Vector Search
when indexing a table with numeric keys (instead of string), that a
separate bug in the vector store caused the indexing to fail, after
which the "status" API reported the index is SERVING - while the "ann"
API refused to proceed because it knew the indexing never finished.

These two APIs need to be consistent: If an "ann" will never succeed
because the indexing is permenently stuck in the middle (because of
a bug...), then the "status" API should also permanently report
BOOTSTRAPPING, not SERVING.

Fix by making get_index_status use the same full_scan_progress() check as
the ANN endpoint: report BOOTSTRAPPING whenever the scan is still InProgress,
regardless of the state machine.

Fixes VECTOR-595

@github-actions github-actions Bot added the P2 label Mar 29, 2026
@nyh nyh force-pushed the fix-595 branch 2 times, most recently from 6933f28 to 02dff74 Compare March 29, 2026 17:51
@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented Mar 29, 2026

I can't add reviewers (!)
So @QuerthDP @ewienik @m-szymon can one of you please review this patch?
It's not critical - fixing the actual bug with numeric keys is more important - but it's helpful to prevent one bug from cascading into a more bizarre bug.

By the way, is there a fix for the numeric key bug already?

QuerthDP
QuerthDP previously approved these changes Mar 30, 2026
Copy link
Copy Markdown
Member

@QuerthDP QuerthDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so as far as I understand, this problem could occur only when the initial_scan function finishes, which leads to node_state.send_event(Event::FullScanFinished, but it did finish because of failure and the full_scan_progress() does not reflect that it was done succesfully.

It looks fine for me. The only thing I don't quite like is the comment - it does not describe this situation well enough for me.
By the way will we exit the BOOTSTRAPPING status ever if the full scan failed?
I think not. Maybe we could consider adding some Error state instead.
@ewienik wdyt?

@ewienik
Copy link
Copy Markdown
Collaborator

ewienik commented Mar 30, 2026

Ok, so as far as I understand, this problem could occur only when the initial_scan function finishes, which leads to node_state.send_event(Event::FullScanFinished, but it did finish because of failure and the full_scan_progress() does not reflect that it was done succesfully.

It looks fine for me. The only thing I don't quite like is the comment - it does not describe this situation well enough for me. By the way will we exit the BOOTSTRAPPING status ever if the full scan failed? I think not. Maybe we could consider adding some Error state instead. @ewienik wdyt?

I've doubt about this solution. We should trust NodeState as it it used also for calculating state of the full node. We should fix db_index to send correctly FullScanFinished instead of providing workaround. I'm doubt now that we should have full_scan_progress in db_index actor at all - we should have this information in NodeState to avoid confusion. As this is different issue I'll prepare a ticket for removing full_scan_progress from db_index.

Comment on lines +319 to +321
// Use the same full-scan-progress check as the ANN endpoint: if the scan
// is still in progress, report BOOTSTRAPPING even if the state machine
// already transitioned to Serving.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should trust NodeState. If it has wrong value we should fix db_index or node_state to correct this. As db_index knows the fullscan progress it should correctly send event to the node_state.
We should also have a validator test to check this behavior.

@swasik
Copy link
Copy Markdown
Collaborator

swasik commented Apr 20, 2026

@nyh would you like to proceed with this?

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented May 10, 2026

@nyh would you like to proceed with this?

I forgot about it. I'll rebase and address reviews now.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented May 10, 2026

I've doubt about this solution. We should trust NodeState as it it used also for calculating state of the full node. We should fix db_index to send correctly FullScanFinished instead of providing workaround. I'm doubt now that we should have full_scan_progress in db_index actor at all - we should have this information in NodeState to avoid confusion. As this is different issue I'll prepare a ticket for removing full_scan_progress from db_index.

I'm cleaning up this patch so it could be merged, but if you think this is the wrong approach, then please do open a ticket (I don't know if you did, JIRA is crap, so please explicitly point a link here if you did) and I consider fixing that way.

In any case, there is a bug here that needs to be fixed - the vector store right now has two ways to check if indexing finished, and they return different things in case the indexing failed completely because of a bug.

By the way, it's important that if the indexing failed because of a bug, we must not report that it finished and allow making queries (which means the fix can't be in the opposite direction than what I did!). @QuerthDP wondered if maybe we should switched from a "BOOTSTRAPPING" state to some permanent "FAILED" state. Maybe, but maybe not: I would have liked (I don't know if it's based on reality, but one can hope) that if the vector store is restarted after patching it, maybe it will restart the failed indexing, and this time be able to finish the bootstrapping. So keeping it in BOOTSTRAPPING until it finally finishes is sort of what users expect, I think.

The GET /api/v1/indexes/{keyspace}/{index}/status endpoint was reporting
SERVING based solely on the node state machine, which transitions to
Serving when FullScanFinished is emitted. However, the ANN query endpoint
independently checks db_index.full_scan_progress() and returns 503 if the
scan is still InProgress.

These two checks could disagree. We noticed in Alternator Vector Search
when indexing a table with numeric keys (instead of string), that a
separate bug in the vector store caused the indexing to fail, after
which the "status" API reported the index is SERVING - while the "ann"
API refused to proceed because it knew the indexing never finished.

These two APIs need to be consistent: If an "ann" will never succeed
because the indexing is permenently stuck in the middle (because of
a bug...), then the "status" API should also permanently report
BOOTSTRAPPING, not SERVING.

Fix by making get_index_status use the same full_scan_progress() check as
the ANN endpoint: report BOOTSTRAPPING whenever the scan is still InProgress,
regardless of the state machine.

Fixes VECTOR-595

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

x
@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented May 10, 2026

Sent a new version with silly fixes to please "clippy".
Please note that clippy still has complaints about other files NOT related to this patch. I didn't fix those, and the CI is pleased nontheless. What the heck?

@swasik and others, please consider your repository's configuration which currently removes reviews (marks them "stale") after someone pushes a new version, even if it's just a style fix. This is frustrating - @QuerthDP already "approved" this PR and @ewienik "disapproved" it, but after I pushed the clippy-pleasing change, the approval was deleted but disapproval remains. This makes very little sense.

According to AI, the relevant github configuration is "Dismiss stale pull request approvals when new commits are pushed." which you apparently set to on.

@QuerthDP
Copy link
Copy Markdown
Member

According to AI, the relevant github configuration is "Dismiss stale pull request approvals when new commits are pushed." which you apparently set to on.

Does not it make sense that the old approval is gone when new changes are introduced?
I fully agree it's frustrating to dismiss the approvals by just introducing some silly style fixes, but I think preventing the stale reviews from being effective when the PR changes significantly is better, even in case the reviewer has to re-approve after little changes.

@ewienik
Copy link
Copy Markdown
Collaborator

ewienik commented May 11, 2026

According to AI, the relevant github configuration is "Dismiss stale pull request approvals when new commits are pushed." which you apparently set to on.

Does not it make sense that the old approval is gone when new changes are introduced? I fully agree it's frustrating to dismiss the approvals by just introducing some silly style fixes, but I think preventing the stale reviews from being effective when the PR changes significantly is better, even in case the reviewer has to re-approve after little changes.

I've setup this configuration manually. IMHO after a small change someone needs to review it before merging. So the best person can be the original reviewer. I think this is even more important in AI-assisted code.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented May 11, 2026

According to AI, the relevant github configuration is "Dismiss stale pull request approvals when new commits are pushed." which you apparently set to on.

Does not it make sense that the old approval is gone when new changes are introduced?

In my opinion, no, I think it does not make sense:

The review process sometimes begins with big overhauls in the beginning, which may need everyone to re-review (but even that, I'm not certain). But the review process almost always ends with a string of nitpicks, silly clippy fixes, and so on. As your setup currently dictates, those silly changes at the end forces everyone that had already approved to approve again. Do you expect these reviewers to really do a full review at the end (probably not, or we'll end up in an endless loop)? Or do you expect them to just rubber-stamp their old review and mark "approved" again without reading? Isn't it better to just leave the old review (also making it clear it's an old review) then to force the reviewer to pretend (that that what it is...) that they reviewed again?

In an ideal world, reviewers would have the energy to review, and re-review, and re-re-review after every tiny change. This actually works with copilot's reviews. It does NOT work with human reviewers. It works even worse when several reviewers are involved, and you "stale" on reviewer's decision because the developer addressed some other reviewer's request.

Finally, as I noted, it's kind of strange that negative reviews stick even after changes, but positive reviews does not. Imagine that a certain PR has 10 approvals and one negative review asking for a tiny change. The developer fixes this tiny request, and now... You have a PR with zero approvals and one negative review. From something that had 10 approvals, now fixing one tiny nitpick the PR suddenly looks like a disaster that nobody approved and only had a negative review.

I fully agree it's frustrating to dismiss the approvals by just introducing some silly style fixes, but I think preventing the stale reviews from being effective when the PR changes significantly is better, even in case the reviewer has to re-approve after little changes.

Do the reviewers agree with you and actually come to re-approve after every little change? I, as a human reviewer, will not agree to that. I will not re-review a PR that I reviewed earlier just because the developer fixed a few spelling errors and now he expects me to come in again, remember what this PR was all about, read it again and check the only thing that happened was fixing spelling errors.

@ewienik
Copy link
Copy Markdown
Collaborator

ewienik commented May 11, 2026

The review process sometimes begins with big overhauls in the beginning, which may need everyone to re-review (but even that, I'm not certain). But the review process almost always ends with a string of nitpicks, silly clippy fixes, and so on. As your setup currently dictates, those silly changes at the end forces everyone that had already approved to approve again. Do you expect these reviewers to really do a full review at the end (probably not, or we'll end up in an endless loop)? Or do you expect them to just rubber-stamp their old review and mark "approved" again without reading? Isn't it better to just leave the old review (also making it clear it's an old review) then to force the reviewer to pretend (that that what it is...) that they reviewed again?

Checking if there are small formatting changes seems really quick - git range-diff works really well in this situation. As a maintainer if I see that commits are force pushed I need to check them always before pushing Merge button. And I prefer that PR status reflects this situation and forces me to re-review. In case there were several approvals in history - we don't need all of the reviewers to re-review - one will be enough to re-review and push to merge.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented May 11, 2026

Checking if there are small formatting changes seems really quick - git range-diff works really well in this situation.

I review in github's UI. Does github's UI have this option?

As a maintainer if I see that commits are force pushed I need to check them always before pushing Merge button. And I prefer that PR status reflects this situation and forces me to re-review. In case there were several approvals in history - we don't need all of the reviewers to re-review - one will be enough to re-review and push to merge.

It seems you're assuming that usually, there will be just one reviewer, basically the person who merges the code - either there are really no other reviewers, or if there are, their approvals will always be forgotten before the final merge and not taken into account in the final decision. I personally like to see when I merge some code that other people looked at it too and approved it. Even if I know the other reviewers' approval happened a few days ago before the last-moment changes that happened an hour ago, I assume that the old reviewers did look at the bulk of the changes. As a maintainer I also see if the developer completely rewrote his code and can ask for re-reviews (there is a "ask for re-review" button).

@QuerthDP
Copy link
Copy Markdown
Member

Checking if there are small formatting changes seems really quick - git range-diff works really well in this situation.

I review in github's UI. Does github's UI have this option?

The "Compare" button next to the force-push works fine for me in most cases. The problem begins when one rebases the PR, then the nitpicks and all the commits from master gets mixed and requires re-review of the whole patch.

@ewienik
Copy link
Copy Markdown
Collaborator

ewienik commented May 11, 2026

I've doubt about this solution. We should trust NodeState as it it used also for calculating state of the full node. We should fix db_index to send correctly FullScanFinished instead of providing workaround. I'm doubt now that we should have full_scan_progress in db_index actor at all - we should have this information in NodeState to avoid confusion. As this is different issue I'll prepare a ticket for removing full_scan_progress from db_index.

I'm cleaning up this patch so it could be merged, but if you think this is the wrong approach, then please do open a ticket (I don't know if you did, JIRA is crap, so please explicitly point a link here if you did) and I consider fixing that way.

In any case, there is a bug here that needs to be fixed - the vector store right now has two ways to check if indexing finished, and they return different things in case the indexing failed completely because of a bug.

Currently we have one way to check index status - using NodeState. DbIndex sends events about the fullscan process state - it is one of the source of information about a current index status, NodeState compute index status using all events. Progress of fullscan is used for informative reasons in errors returned by HTTP API. If there is a mismatch between current fullscan state and index status then it is a bug. We need to provide test which repeat this behavior and fix the root cause.

By the way, it's important that if the indexing failed because of a bug, we must not report that it finished and allow making queries (which means the fix can't be in the opposite direction than what I did!). @QuerthDP wondered if maybe we should switched from a "BOOTSTRAPPING" state to some permanent "FAILED" state. Maybe, but maybe not: I would have liked (I don't know if it's based on reality, but one can hope) that if the vector store is restarted after patching it, maybe it will restart the failed indexing, and this time be able to finish the bootstrapping. So keeping it in BOOTSTRAPPING until it finally finishes is sort of what users expect, I think.

I don't understand the root cause about this bug. I understand that there is mismatch in index status, but I don't understand when it is visible. Can we reproduce this? Can we prepare a test which will fail?

@ewienik
Copy link
Copy Markdown
Collaborator

ewienik commented May 11, 2026

Checking if there are small formatting changes seems really quick - git range-diff works really well in this situation.

I review in github's UI. Does github's UI have this option?

The "Compare" button next to the force-push works fine for me in most cases. The problem begins when one rebases the PR, then the nitpicks and all the commits from master gets mixed and requires re-review of the whole patch.

For this reason I prepare range-diff myself while preparing push to the PR - to help reviewers with that. I don't use GitHub for range-diff for other developer's patches - I'm doing that on my laptop. GitHub has its own limitation in review :-(

@ewienik
Copy link
Copy Markdown
Collaborator

ewienik commented May 11, 2026

It seems you're assuming that usually, there will be just one reviewer, basically the person who merges the code - either there are really no other reviewers, or if there are, their approvals will always be forgotten before the final merge and not taken into account in the final decision. I personally like to see when I merge some code that other people looked at it too and approved it. Even if I know the other reviewers' approval happened a few days ago before the last-moment changes that happened an hour ago, I assume that the old reviewers did look at the bulk of the changes. As a maintainer I also see if the developer completely rewrote his code and can ask for re-reviews (there is a "ask for re-review" button).

If there are more reviewers I see their approvals in the timeline. It seems like a trade-off - see approvals in status or see the current status and be forced to do short re-review.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented May 11, 2026

I don't understand the root cause about this bug. I understand that there is mismatch in index status, but I don't understand when it is visible. Can we reproduce this? Can we prepare a test which will fail?

As I explain in the commit message, this bug became visible because of another bug - the vector store had a bug reading numeric keys and failed the indexing. When this failed indexing happened, we had this bug: the "status" API began to return "SERVING", telling the Alternator user that index is ready, and yet when a ANN query was attempted, it failed telling the index wasn't ready.

This isn't a "serious" bug - it won't happen if the vector store doesn't already have another bug that causes indexing to fail, and also it doesn't break any working request (the ANN had to fail anyway). But it is just confusing and misleading - one API tells you the index is ready, a different one tells you the index isn't ready.

I don't know much about vector store internals, but it's clear to me that the source-of-truth here is whatever the ANN consults: If ANN thinks the index isn't available, then it isn't, and "status" should also say that. This is why I fixed "status" in my patch.

@ewienik
Copy link
Copy Markdown
Collaborator

ewienik commented May 11, 2026

I don't understand the root cause about this bug. I understand that there is mismatch in index status, but I don't understand when it is visible. Can we reproduce this? Can we prepare a test which will fail?

As I explain in the commit message, this bug became visible because of another bug - the vector store had a bug reading numeric keys and failed the indexing. When this failed indexing happened, we had this bug: the "status" API began to return "SERVING", telling the Alternator user that index is ready, and yet when a ANN query was attempted, it failed telling the index wasn't ready.

This isn't a "serious" bug - it won't happen if the vector store doesn't already have another bug that causes indexing to fail, and also it doesn't break any working request (the ANN had to fail anyway). But it is just confusing and misleading - one API tells you the index is ready, a different one tells you the index isn't ready.

I don't know much about vector store internals, but it's clear to me that the source-of-truth here is whatever the ANN consults: If ANN thinks the index isn't available, then it isn't, and "status" should also say that. This is why I fixed "status" in my patch.

The source-of-truth is NodeState. It seems that the problem is withing post_index_ann and not in get_index_status. Handler for HTTP index ann is wrong - it should depend on NodeState and not full scan progress from db_index. Handler for HTTP index status does it correctly - it checks NodeState. With the introduction of index routing it seems the issue is fixed - fullscan progress is retrieved only for information when index is not serving.

The behavior could likely be tested with test/integration tests by simulating not finished fullscan and sending event that fullscan was finished.

@nyh
Copy link
Copy Markdown
Contributor Author

nyh commented May 11, 2026

The source-of-truth is NodeState. It seems that the problem is withing post_index_ann and not in get_index_status. Handler for HTTP index ann is wrong - it should depend on NodeState and not full scan progress from db_index. Handler for HTTP index status does it correctly - it checks NodeState

It sounds like you're telling me that the right behavior is that the ann function should allow queries to proceed if the indexing stopped in a failure.
I think this would be a mistake, and a step backward - not a step forward.
I think that if we know that the indexing failed (and again, this can only happen if vector store has another bug - it can't happen in regular operation), we must not allow any query to proceed. We can't just ignore the fact that the indexing failed.

@ewienik
Copy link
Copy Markdown
Collaborator

ewienik commented May 12, 2026

The source-of-truth is NodeState. It seems that the problem is withing post_index_ann and not in get_index_status. Handler for HTTP index ann is wrong - it should depend on NodeState and not full scan progress from db_index. Handler for HTTP index status does it correctly - it checks NodeState

It sounds like you're telling me that the right behavior is that the ann function should allow queries to proceed if the indexing stopped in a failure. I think this would be a mistake, and a step backward - not a step forward. I think that if we know that the indexing failed (and again, this can only happen if vector store has another bug - it can't happen in regular operation), we must not allow any query to proceed. We can't just ignore the fact that the indexing failed.

This is an issue, IMHO it is different one. Maybe we should extend IndexStatus with another state - Failed, or something else, to show that something bad happened to the index. But halting searches on such index is the real question. Building index could takes hours - if there is an error with a single vector - should we disable working on such index? If an error happened during CDC, not fullscan (it is possible use case - the customer wants to create an empty index, then add vectors using cdc - to be able to search during index building) - should we disallow searches on this index? If there is no memory on the instance and we are not able to add new indexes - it is an error of index building - should we halt searching on this index? There are trade-offs.
From the logic of vector-store state the actual index state should be separate from fullscan and it is now in NodeState, so all HTTP routes should check NodeState instead of full scan progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants