fix index status endpoint to reflect actual scan progress#404
Conversation
6933f28 to
02dff74
Compare
|
I can't add reviewers (!) By the way, is there a fix for the numeric key bug already? |
QuerthDP
left a comment
There was a problem hiding this comment.
Ok, so as far as I understand, this problem could occur only when the initial_scan function finishes, which leads to node_state.send_event(Event::FullScanFinished, but it did finish because of failure and the full_scan_progress() does not reflect that it was done succesfully.
It looks fine for me. The only thing I don't quite like is the comment - it does not describe this situation well enough for me.
By the way will we exit the BOOTSTRAPPING status ever if the full scan failed?
I think not. Maybe we could consider adding some Error state instead.
@ewienik wdyt?
I've doubt about this solution. We should trust NodeState as it it used also for calculating state of the full node. We should fix db_index to send correctly FullScanFinished instead of providing workaround. I'm doubt now that we should have full_scan_progress in db_index actor at all - we should have this information in NodeState to avoid confusion. As this is different issue I'll prepare a ticket for removing full_scan_progress from db_index. |
| // Use the same full-scan-progress check as the ANN endpoint: if the scan | ||
| // is still in progress, report BOOTSTRAPPING even if the state machine | ||
| // already transitioned to Serving. |
There was a problem hiding this comment.
I think we should trust NodeState. If it has wrong value we should fix db_index or node_state to correct this. As db_index knows the fullscan progress it should correctly send event to the node_state.
We should also have a validator test to check this behavior.
|
@nyh would you like to proceed with this? |
I forgot about it. I'll rebase and address reviews now. |
I'm cleaning up this patch so it could be merged, but if you think this is the wrong approach, then please do open a ticket (I don't know if you did, JIRA is crap, so please explicitly point a link here if you did) and I consider fixing that way. In any case, there is a bug here that needs to be fixed - the vector store right now has two ways to check if indexing finished, and they return different things in case the indexing failed completely because of a bug. By the way, it's important that if the indexing failed because of a bug, we must not report that it finished and allow making queries (which means the fix can't be in the opposite direction than what I did!). @QuerthDP wondered if maybe we should switched from a "BOOTSTRAPPING" state to some permanent "FAILED" state. Maybe, but maybe not: I would have liked (I don't know if it's based on reality, but one can hope) that if the vector store is restarted after patching it, maybe it will restart the failed indexing, and this time be able to finish the bootstrapping. So keeping it in BOOTSTRAPPING until it finally finishes is sort of what users expect, I think. |
The GET /api/v1/indexes/{keyspace}/{index}/status endpoint was reporting
SERVING based solely on the node state machine, which transitions to
Serving when FullScanFinished is emitted. However, the ANN query endpoint
independently checks db_index.full_scan_progress() and returns 503 if the
scan is still InProgress.
These two checks could disagree. We noticed in Alternator Vector Search
when indexing a table with numeric keys (instead of string), that a
separate bug in the vector store caused the indexing to fail, after
which the "status" API reported the index is SERVING - while the "ann"
API refused to proceed because it knew the indexing never finished.
These two APIs need to be consistent: If an "ann" will never succeed
because the indexing is permenently stuck in the middle (because of
a bug...), then the "status" API should also permanently report
BOOTSTRAPPING, not SERVING.
Fix by making get_index_status use the same full_scan_progress() check as
the ANN endpoint: report BOOTSTRAPPING whenever the scan is still InProgress,
regardless of the state machine.
Fixes VECTOR-595
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
x
|
Sent a new version with silly fixes to please "clippy". @swasik and others, please consider your repository's configuration which currently removes reviews (marks them "stale") after someone pushes a new version, even if it's just a style fix. This is frustrating - @QuerthDP already "approved" this PR and @ewienik "disapproved" it, but after I pushed the clippy-pleasing change, the approval was deleted but disapproval remains. This makes very little sense. According to AI, the relevant github configuration is "Dismiss stale pull request approvals when new commits are pushed." which you apparently set to on. |
Does not it make sense that the old approval is gone when new changes are introduced? |
I've setup this configuration manually. IMHO after a small change someone needs to review it before merging. So the best person can be the original reviewer. I think this is even more important in AI-assisted code. |
In my opinion, no, I think it does not make sense: The review process sometimes begins with big overhauls in the beginning, which may need everyone to re-review (but even that, I'm not certain). But the review process almost always ends with a string of nitpicks, silly clippy fixes, and so on. As your setup currently dictates, those silly changes at the end forces everyone that had already approved to approve again. Do you expect these reviewers to really do a full review at the end (probably not, or we'll end up in an endless loop)? Or do you expect them to just rubber-stamp their old review and mark "approved" again without reading? Isn't it better to just leave the old review (also making it clear it's an old review) then to force the reviewer to pretend (that that what it is...) that they reviewed again? In an ideal world, reviewers would have the energy to review, and re-review, and re-re-review after every tiny change. This actually works with copilot's reviews. It does NOT work with human reviewers. It works even worse when several reviewers are involved, and you "stale" on reviewer's decision because the developer addressed some other reviewer's request. Finally, as I noted, it's kind of strange that negative reviews stick even after changes, but positive reviews does not. Imagine that a certain PR has 10 approvals and one negative review asking for a tiny change. The developer fixes this tiny request, and now... You have a PR with zero approvals and one negative review. From something that had 10 approvals, now fixing one tiny nitpick the PR suddenly looks like a disaster that nobody approved and only had a negative review.
Do the reviewers agree with you and actually come to re-approve after every little change? I, as a human reviewer, will not agree to that. I will not re-review a PR that I reviewed earlier just because the developer fixed a few spelling errors and now he expects me to come in again, remember what this PR was all about, read it again and check the only thing that happened was fixing spelling errors. |
Checking if there are small formatting changes seems really quick - |
I review in github's UI. Does github's UI have this option?
It seems you're assuming that usually, there will be just one reviewer, basically the person who merges the code - either there are really no other reviewers, or if there are, their approvals will always be forgotten before the final merge and not taken into account in the final decision. I personally like to see when I merge some code that other people looked at it too and approved it. Even if I know the other reviewers' approval happened a few days ago before the last-moment changes that happened an hour ago, I assume that the old reviewers did look at the bulk of the changes. As a maintainer I also see if the developer completely rewrote his code and can ask for re-reviews (there is a "ask for re-review" button). |
The "Compare" button next to the force-push works fine for me in most cases. The problem begins when one rebases the PR, then the nitpicks and all the commits from master gets mixed and requires re-review of the whole patch. |
Currently we have one way to check index status - using NodeState. DbIndex sends events about the fullscan process state - it is one of the source of information about a current index status, NodeState compute index status using all events. Progress of fullscan is used for informative reasons in errors returned by HTTP API. If there is a mismatch between current fullscan state and index status then it is a bug. We need to provide test which repeat this behavior and fix the root cause.
I don't understand the root cause about this bug. I understand that there is mismatch in index status, but I don't understand when it is visible. Can we reproduce this? Can we prepare a test which will fail? |
For this reason I prepare |
If there are more reviewers I see their approvals in the timeline. It seems like a trade-off - see approvals in status or see the current status and be forced to do short re-review. |
As I explain in the commit message, this bug became visible because of another bug - the vector store had a bug reading numeric keys and failed the indexing. When this failed indexing happened, we had this bug: the "status" API began to return "SERVING", telling the Alternator user that index is ready, and yet when a ANN query was attempted, it failed telling the index wasn't ready. This isn't a "serious" bug - it won't happen if the vector store doesn't already have another bug that causes indexing to fail, and also it doesn't break any working request (the ANN had to fail anyway). But it is just confusing and misleading - one API tells you the index is ready, a different one tells you the index isn't ready. I don't know much about vector store internals, but it's clear to me that the source-of-truth here is whatever the ANN consults: If ANN thinks the index isn't available, then it isn't, and "status" should also say that. This is why I fixed "status" in my patch. |
The source-of-truth is NodeState. It seems that the problem is withing post_index_ann and not in get_index_status. Handler for HTTP index ann is wrong - it should depend on NodeState and not full scan progress from db_index. Handler for HTTP index status does it correctly - it checks NodeState. With the introduction of index routing it seems the issue is fixed - fullscan progress is retrieved only for information when index is not serving. The behavior could likely be tested with test/integration tests by simulating not finished fullscan and sending event that fullscan was finished. |
It sounds like you're telling me that the right behavior is that the ann function should allow queries to proceed if the indexing stopped in a failure. |
This is an issue, IMHO it is different one. Maybe we should extend IndexStatus with another state - Failed, or something else, to show that something bad happened to the index. But halting searches on such index is the real question. Building index could takes hours - if there is an error with a single vector - should we disable working on such index? If an error happened during CDC, not fullscan (it is possible use case - the customer wants to create an empty index, then add vectors using cdc - to be able to search during index building) - should we disallow searches on this index? If there is no memory on the instance and we are not able to add new indexes - it is an error of index building - should we halt searching on this index? There are trade-offs. |
fix index status endpoint to reflect actual scan progress
The GET /api/v1/indexes/{keyspace}/{index}/status endpoint was reporting
SERVING based solely on the node state machine, which transitions to
Serving when FullScanFinished is emitted. However, the ANN query endpoint
independently checks db_index.full_scan_progress() and returns 503 if the
scan is still InProgress.
These two checks could disagree. We noticed in Alternator Vector Search
when indexing a table with numeric keys (instead of string), that a
separate bug in the vector store caused the indexing to fail, after
which the "status" API reported the index is SERVING - while the "ann"
API refused to proceed because it knew the indexing never finished.
These two APIs need to be consistent: If an "ann" will never succeed
because the indexing is permenently stuck in the middle (because of
a bug...), then the "status" API should also permanently report
BOOTSTRAPPING, not SERVING.
Fix by making get_index_status use the same full_scan_progress() check as
the ANN endpoint: report BOOTSTRAPPING whenever the scan is still InProgress,
regardless of the state machine.
Fixes VECTOR-595