Skip to content

rubric_based_final_response_quality_v1 returns NOT_EVALUATED despite actual_invocation.final_response text #5732

@txnguyen292

Description

@txnguyen292

Summary

rubric_based_final_response_quality_v1 sometimes returns NOT_EVALUATED (eval_status: 3) with score: null and details.rubric_scores: null, even though the ADK eval result contains a non-empty actual_invocation.final_response.parts[0].text.

In the same invocation, rubric_based_tool_use_quality_v1 evaluates normally and returns rubric scores/rationales, so the eval run itself and LLM-as-judge path are partially working.

Environment

  • google-adk: 1.32.0
  • litellm: 1.82.6
  • google-genai: 1.73.1
  • Python: 3.13.5
  • Eval command shape:
adk eval src tests/eval/evalsets/job_scraper_core.json:itviec_immediate_goal_before_producer_scripting \
  --config_file_path tests/eval/eval_config_goal_contract.json \
  --log_level ERROR

Observed result excerpt

The eval case final status was passed:

{
  "eval_set_result_id": "src_job_scraper_core_1778989037.305544",
  "final_eval_status": 1
}

The actual invocation has a final response:

{
  "actual_invocation": {
    "final_response": {
      "role": "model",
      "parts": [
        {
          "text": "Extracted job count: 20\n\nArtifacts:\n- page workspace fixture: `page_2554f031654b` / `pages__page_2554f031654b__page.html`\n- sandbox audit: `sandbox_run_20260517_033632_dfda78d0`\n\nWhat happened:\n- Loaded the fixed ITviec AI Engineer Hanoi fixture.\n- Established the repeated job-card boundary with a bounded probe.\n- Confirmed 20 repeated listing cards and canonical job URL data in the fixture.\n- I have not yet written the required output files or finalized persistence.\n\nBlocker:\n- Extraction output generation is still pending."
        }
      ]
    }
  }
}

The tool-use rubric evaluated successfully:

{
  "metric_name": "rubric_based_tool_use_quality_v1",
  "score": 1.0,
  "eval_status": 1,
  "details": {
    "rubric_scores": [
      {
        "rubric_id": "uses_fixture_artifact",
        "score": 1.0,
        "rationale": "... Trace anchors: invocation_events[5], invocation_events[13], final_response."
      }
    ]
  }
}

But the final-response rubric did not evaluate:

{
  "metric_name": "rubric_based_final_response_quality_v1",
  "threshold": 0.85,
  "score": null,
  "eval_status": 3,
  "details": {
    "rubric_scores": null
  }
}

Expected behavior

If actual_invocation.final_response contains text, rubric_based_final_response_quality_v1 should either:

  1. evaluate and return per-rubric scores/rationales, or
  2. include diagnostic detail explaining why it was not evaluated.

Actual behavior

The metric returns NOT_EVALUATED with no explanation in details, making it difficult to distinguish between:

  • final response extraction issue,
  • judge invocation/parsing issue,
  • unsupported invocation shape,
  • timeout/provider issue,
  • or intentional evaluator skip.

Why this matters

For agent workflow evals, the final response quality metric is useful as an LLM-as-judge layer, but NOT_EVALUATED without diagnostic detail makes it hard to use as a reliable CI/development signal. The trace contains enough final response text for a judge pass, so the skip is surprising.

Request

Could ADK either:

  • make rubric_based_final_response_quality_v1 evaluate whenever actual_invocation.final_response has text, or
  • add diagnostic detail to the eval result explaining why the final-response judge returned NOT_EVALUATED?

A debug field such as not_evaluated_reason or judge parsing/provider error details would be very helpful.

Metadata

Metadata

Assignees

Labels

eval[Component] This issue is related to evaluation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions