feat(llm): expose cached_tokens from Vertex AI context caching#538
Draft
breakanalysis wants to merge 1 commit into
Draft
feat(llm): expose cached_tokens from Vertex AI context caching#538breakanalysis wants to merge 1 commit into
breakanalysis wants to merge 1 commit into
Conversation
…Usage and OTel traces - Add `cached_tokens` field to `LLMUsage` for prompt tokens served from Vertex AI's implicit context cache (cached_content_token_count). - Update `VertexAILLM._parse_content_response` to populate `cached_tokens` and, when non-zero, set the `llm.token_count.prompt_details.cache_read` span attribute on the currently active OpenTelemetry span. The OTel attribute is set inside `_parse_content_response`, which is called while the openinference-instrumentation-vertexai span is still active, so the attribute appears on the correct LLM span in Cloud Trace. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cached_tokens: Optional[int] = NonetoLLMUsagefor prompt tokens served from Vertex AI's implicit context cache (cached_content_token_countinUsageMetadata)VertexAILLM._parse_content_responseto populatecached_tokensand setllm.token_count.prompt_details.cache_readspan attribute on the active OpenTelemetry span when cached tokens are presentContext
Vertex AI's implicit caching transparently caches repeated context (system prompt, schema) and reports cached token counts in
UsageMetadata.cached_content_token_count. Theopeninference-instrumentation-vertexailibrary does not currently expose this count. Setting the attribute in_parse_content_responseworks because it runs while the openinference instrumentation span is still active.Test plan
pytest tests/unit/llm/test_vertexai_llm.pyto verify existing tests passcached_tokensappears inLLMResponse.usagefor a real VertexAI call with context caching enabled🤖 Generated with Claude Code