Skip to content

fix: surface eval failures instead of silently terminating or crashing#398

Open
IsmailMehdi wants to merge 5 commits into
mainfrom
fix/prod-log-triage-eval-failures
Open

fix: surface eval failures instead of silently terminating or crashing#398
IsmailMehdi wants to merge 5 commits into
mainfrom
fix/prod-log-triage-eval-failures

Conversation

@IsmailMehdi
Copy link
Copy Markdown
Collaborator

@IsmailMehdi IsmailMehdi commented May 20, 2026

Summary

Three issues surfaced in the last 24h of prod logs on evalbench-directpath-cluster:

  • SimulatedUser silently terminated every multi-turn scenario (~115 occurrences) when simulated_user_model_config was missing or failed to load. The constructor warned but left self.model_generator = None; get_next_response then returned "TERMINATE" on every call, so scenarios appeared to "complete" on turn 1 with no visible cause. Now __init__ raises so the misconfiguration surfaces at scenario start.

  • _process_results used a bare assert (~13 occurrences) when no eval rows were produced. The AssertionError bubbled out of the Eval RPC as an unstructured INTERNAL and a client was retrying at ~10s cadence on what is actually a config-level problem. Replaced with a typed EmptyEvalResultError caught in the RPC handler and translated to FAILED_PRECONDITION, with a message pointing at dataset/dialect/database mismatch.

  • DB-queue acquire failures logged the bare exception (~4 occurrences), which is empty for queue.Empty timeouts — the operator saw ... 'foo': with nothing after the colon. Falls back to type(e).__name__ when str(e) is empty.

Test plan

  • Run an existing multi-turn eval with simulated_user_model_config set — should behave as before
  • Run a multi-turn eval with simulated_user_model_config omitted — should raise at scenario start with a clear message instead of silently terminating
  • Trigger the empty-results path (e.g. dialect filter that matches nothing) — RPC should return FAILED_PRECONDITION with the new message instead of INTERNAL
  • Force a DB connection acquire timeout — log line should read ... 'db': Empty rather than ... 'db':

@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 20, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Three issues found in prod logs (last 24h):

- SimulatedUser silently returned "TERMINATE" on every turn when
  simulated_user_model_config was missing or failed to load, killing
  ~115 multi-turn scenarios with no visible cause. Now raises in
  __init__ so misconfiguration is caught at scenario start.

- _process_results used `assert not results_df.empty`, which bubbled
  out of the Eval RPC as an unstructured INTERNAL error and triggered
  a client retry storm (~10s cadence) on a config-level problem. Now
  raises EmptyEvalResultError, translated to FAILED_PRECONDITION with
  a message pointing at dataset/dialect/database mismatch.

- DB-queue acquire failures logged the bare exception, which is empty
  for queue.Empty timeouts ("...': "). Now falls back to the exception
  class name so the operator can see what failed.
@IsmailMehdi IsmailMehdi force-pushed the fix/prod-log-triage-eval-failures branch from 5b8982f to 4a38025 Compare May 20, 2026 00:17
prernakakkar-google and others added 2 commits May 19, 2026 17:27
…ecommit

The bird-interact-lite directory is populated at runtime by
datasets/bird-interact/download_dataset.sh, which clones
https://huggingface.co/datasets/birdsql/bird-interact-lite into it.

Without this ignore, a subsequent 'git add' would re-record the embedded
.git/ as a gitlink with no .gitmodules entry — which is exactly the
broken-submodule state this PR is removing in the first place.
@IsmailMehdi
Copy link
Copy Markdown
Collaborator Author

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants