fix: surface eval failures instead of silently terminating or crashing#398
Open
IsmailMehdi wants to merge 5 commits into
Open
fix: surface eval failures instead of silently terminating or crashing#398IsmailMehdi wants to merge 5 commits into
IsmailMehdi wants to merge 5 commits into
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Three issues found in prod logs (last 24h):
- SimulatedUser silently returned "TERMINATE" on every turn when
simulated_user_model_config was missing or failed to load, killing
~115 multi-turn scenarios with no visible cause. Now raises in
__init__ so misconfiguration is caught at scenario start.
- _process_results used `assert not results_df.empty`, which bubbled
out of the Eval RPC as an unstructured INTERNAL error and triggered
a client retry storm (~10s cadence) on a config-level problem. Now
raises EmptyEvalResultError, translated to FAILED_PRECONDITION with
a message pointing at dataset/dialect/database mismatch.
- DB-queue acquire failures logged the bare exception, which is empty
for queue.Empty timeouts ("...': "). Now falls back to the exception
class name so the operator can see what failed.
5b8982f to
4a38025
Compare
…ecommit The bird-interact-lite directory is populated at runtime by datasets/bird-interact/download_dataset.sh, which clones https://huggingface.co/datasets/birdsql/bird-interact-lite into it. Without this ignore, a subsequent 'git add' would re-record the embedded .git/ as a gitlink with no .gitmodules entry — which is exactly the broken-submodule state this PR is removing in the first place.
Collaborator
Author
|
/gcbrun |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three issues surfaced in the last 24h of prod logs on
evalbench-directpath-cluster:SimulatedUsersilently terminated every multi-turn scenario (~115 occurrences) whensimulated_user_model_configwas missing or failed to load. The constructor warned but leftself.model_generator = None;get_next_responsethen returned"TERMINATE"on every call, so scenarios appeared to "complete" on turn 1 with no visible cause. Now__init__raises so the misconfiguration surfaces at scenario start._process_resultsused a bareassert(~13 occurrences) when no eval rows were produced. TheAssertionErrorbubbled out of theEvalRPC as an unstructuredINTERNALand a client was retrying at ~10s cadence on what is actually a config-level problem. Replaced with a typedEmptyEvalResultErrorcaught in the RPC handler and translated toFAILED_PRECONDITION, with a message pointing at dataset/dialect/database mismatch.DB-queue acquire failures logged the bare exception (~4 occurrences), which is empty for
queue.Emptytimeouts — the operator saw... 'foo':with nothing after the colon. Falls back totype(e).__name__whenstr(e)is empty.Test plan
simulated_user_model_configset — should behave as beforesimulated_user_model_configomitted — should raise at scenario start with a clear message instead of silently terminatingFAILED_PRECONDITIONwith the new message instead ofINTERNAL... 'db': Emptyrather than... 'db':