Skip to content

feat: Instrument pre-completion chat availability#173

Open
randy-concepcion wants to merge 8 commits into
mainfrom
chat-availability-sli
Open

feat: Instrument pre-completion chat availability#173
randy-concepcion wants to merge 8 commits into
mainfrom
chat-availability-sli

Conversation

@randy-concepcion

Copy link
Copy Markdown

What's New

Adds pre-completion coverage to mlpa_chat_availability_total, so auth rejections and user-provisioning failures show up in the chat availability signal.

A chat request terminates at one of four points and records exactly one outcome:

  1. Auth (authorize_chat_request) - all excluded:
    • invalid_service_type_for_model
    • auth_rejected
    • invalid_auth_request
  2. User provisioning (get_or_create_user_for_completion):
    • signup_cap_exceeded (excluded)
    • provisioning_failure (failure, 5xx only)
  3. Blocked check: blocked (excluded)
  4. Model call (get_completion / stream_completion): already instrumented

Notes

  • The auth wrapper catches only HTTPException and records only 401 and 400. Everything else, including App Attest's explicit 500, is re-raised without recording.
  • Shared-call 400s use invalid_auth_request rather than invalid_purpose because malformed App Attest base64 reaches the same path as purpose validation.
  • auth_system_failure is defined but not emitted. See the limitation below.
  • Reference: Availability RFC and SLI Inventory.

Known Limitation

This is an interim measurement. For FxA and Play Integrity, the same 401 response is returned whether the request was correctly rejected (such as an expired token) or the authentication system itself had a problem (such as an outage). Since they can't be distinguished, all of these are counted as expected rejections and left out of the ratio for now.

This means that real authentication-system failures are not yet counted as failures. This is a known gap, already noted in the Availability RFC and the SLI Inventory, and it will be addressed once those failures can be identified separately.

The availability ratio is success / (success + failure), and excluded and abort are held out.

@randy-concepcion

Copy link
Copy Markdown
Author

/build-pr

@randy-concepcion

Copy link
Copy Markdown
Author

Manual verification done in dev, in addition to the unit tests.

Verified manually

  • The three auth-stage excluded reasons, triggered with crafted requests and confirmed recording with the right labels and outcome="excluded"
    • auth_rejected (valid service-type, invalid auth)
    • invalid_service_type_for_model (valid model with a service-type it does not allow)
    • invalid_auth_request (invalid purpose)
  • Real dev traffic exercised success and upstream_error, so the availability ratio is live, currently around 99% with dips that line up with upstream_error clusters
  • Confirmed the counter is scraped into Prometheus and queryable

Not verified manually
provisioning_failure, signup_cap_exceeded, and blocked need specific user or system state, so they were not triggered in dev. They are covered by the unit tests. auth_system_failure is defined but not emitted, which is the known auth limitation in the description.

Dashboard
Test dashboard built in Yardstick to validate the metric and compare against the existing success-rate panels.
[TEST] MLPA Availability

image

@randy-concepcion randy-concepcion force-pushed the chat-availability-sli branch 2 times, most recently from 024ccc6 to 2d6922c Compare June 13, 2026 05:10
@randy-concepcion randy-concepcion requested review from a team, noahpodgurski and subpath and removed request for a team June 13, 2026 05:12
@randy-concepcion randy-concepcion marked this pull request as ready for review June 13, 2026 14:25
@randy-concepcion randy-concepcion requested a review from a team as a code owner June 13, 2026 14:25
@randy-concepcion randy-concepcion removed the request for review from a team June 13, 2026 14:26
@randy-concepcion

Copy link
Copy Markdown
Author

I've also added the same panels to our main Grafana dashboard so once these changes are deployed, we can start seeing them in Prod.

MLPA Availability Panels (Prod) - No data available until deployed.
image

@noahpodgurski noahpodgurski left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good 🙌 just a few comments :)

Comment thread src/mlpa/core/prometheus_metrics.py Outdated
Comment thread src/mlpa/core/prometheus_metrics.py
Comment thread src/mlpa/core/completions.py Outdated
Comment thread src/mlpa/core/completions.py
Comment thread src/mlpa/core/completions.py Outdated
Comment thread src/mlpa/core/auth/authorize.py Outdated
AvailabilityReason.INVALID_SERVICE_TYPE_FOR_MODEL,
model=chat_request.model,
service_type=service_type.value,
purpose="",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why not pass the chat_request.purpose here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I found the chat_request object didn't carry the purpose field since it's a ChatRequest, but saw it in the AuthorizedChatRequest object (sub-class of ChatRequest).

I figured we should leave purpose set to "" as a placeholder until the request has been validated. I can add a comment here to make this more clear.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function has a purpose header parameter we could use - Or are you saying since it hasn't been officially resolved with _resolve_purpose we shouldn't pass it, since the value could be anything.

I guess I'm fine with either :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, yeah, you're right about the purpose parameter that was passed in, but your second point was exactly what I was thinking: "the value could be anything". So if we used the unresolved purpose, we'd be labeling the metric with a value that hasn't been validated yet, so it could be something arbitrary.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually now that I think about it more, I think it makes more sense to always pass it. This data could be helpful in making correlations later on. Also, in this particular case, we're already passing the invalid service type, so if we're already passing invalid values then I think purpose should also be included. We can always add whitelisting filtering down the road. WDYT?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah, those are fair points, the correlation value is worth having. Let me update it to pass purpose in. 👍

Comment thread src/mlpa/core/auth/authorize.py
Comment thread src/mlpa/core/errors.py
PrometheusRejectionReason, AvailabilityReason
] = {
PrometheusRejectionReason.BUDGET_EXCEEDED: AvailabilityReason.BUDGET_EXCEEDED,
PrometheusRejectionReason.PAYLOAD_TOO_LARGE: AvailabilityReason.PAYLOAD_TOO_LARGE,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We should also add tracking this to our middleware functions. For instance if request_size.check_request_size_middleware returns 413 record_chat_availability with PAYLOAD_TOO_LARGE should also fire

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo, I did not think of this. Great catch! I'll capture the PAYLOAD_TOO_LARGE for that check. One thing I noticed was that the middleware runs before the body is parsed and before auth, which means the model, service_type, and purpose aren't available yet. I'll set the "" placeholders for those and pass the reason and outcome through.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually all of those values are available under request.headers and request.body. I think it'd be a good idea to pass those in - or were you thinking to explicitly exclude them since the request hasn't yet been verified?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I think one thing we may want to avoid is reading in the whole body to get the model though since we're rejecting with the content-length check. Having service_type and purpose could still be useful info to have, so we can still pass those through.

@randy-concepcion randy-concepcion left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @noahpodgurski! I replied to each comment inline. I'll make some of these changes now and adjust according to your responses.

Comment thread src/mlpa/core/auth/authorize.py Outdated
AvailabilityReason.INVALID_SERVICE_TYPE_FOR_MODEL,
model=chat_request.model,
service_type=service_type.value,
purpose="",

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I found the chat_request object didn't carry the purpose field since it's a ChatRequest, but saw it in the AuthorizedChatRequest object (sub-class of ChatRequest).

I figured we should leave purpose set to "" as a placeholder until the request has been validated. I can add a comment here to make this more clear.

Comment thread src/mlpa/core/auth/authorize.py
Comment thread src/mlpa/core/completions.py Outdated
Comment thread src/mlpa/core/completions.py
Comment thread src/mlpa/core/completions.py Outdated
Comment thread src/mlpa/core/errors.py
PrometheusRejectionReason, AvailabilityReason
] = {
PrometheusRejectionReason.BUDGET_EXCEEDED: AvailabilityReason.BUDGET_EXCEEDED,
PrometheusRejectionReason.PAYLOAD_TOO_LARGE: AvailabilityReason.PAYLOAD_TOO_LARGE,

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo, I did not think of this. Great catch! I'll capture the PAYLOAD_TOO_LARGE for that check. One thing I noticed was that the middleware runs before the body is parsed and before auth, which means the model, service_type, and purpose aren't available yet. I'll set the "" placeholders for those and pass the reason and outcome through.

Comment thread src/mlpa/core/prometheus_metrics.py Outdated
Comment thread src/mlpa/core/prometheus_metrics.py
@randy-concepcion

Copy link
Copy Markdown
Author

@noahpodgurski Thanks again for the feedback! I've pushed changes and is ready for another round of review when you have a chance! 🙏

@randy-concepcion

Copy link
Copy Markdown
Author

/build-pr

@randy-concepcion

randy-concepcion commented Jun 17, 2026

Copy link
Copy Markdown
Author

Manually verified on dev. Grafana link for 2026-06-17 time window.

Prometheus: Metrics Explorer
image

MLPA Availability in Grafana (includes failures)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants