feat: Instrument pre-completion chat availability by randy-concepcion · Pull Request #173 · Firefox-AI/MLPA

randy-concepcion · 2026-06-12T06:12:16Z

What's New

Adds pre-completion coverage to mlpa_chat_availability_total, so auth rejections and user-provisioning failures show up in the chat availability signal.

A chat request terminates at one of four points and records exactly one outcome:

Auth (authorize_chat_request) - all excluded:
- invalid_service_type_for_model
- auth_rejected
- invalid_auth_request
User provisioning (get_or_create_user_for_completion):
- signup_cap_exceeded (excluded)
- provisioning_failure (failure, 5xx only)
Blocked check: blocked (excluded)
Model call (get_completion / stream_completion): already instrumented

Notes

The auth wrapper catches only HTTPException and records only 401 and 400. Everything else, including App Attest's explicit 500, is re-raised without recording.
Shared-call 400s use invalid_auth_request rather than invalid_purpose because malformed App Attest base64 reaches the same path as purpose validation.
auth_system_failure is defined but not emitted. See the limitation below.
Reference: Availability RFC and SLI Inventory.

Known Limitation

This is an interim measurement. For FxA and Play Integrity, the same 401 response is returned whether the request was correctly rejected (such as an expired token) or the authentication system itself had a problem (such as an outage). Since they can't be distinguished, all of these are counted as expected rejections and left out of the ratio for now.

This means that real authentication-system failures are not yet counted as failures. This is a known gap, already noted in the Availability RFC and the SLI Inventory, and it will be addressed once those failures can be identified separately.

The availability ratio is success / (success + failure), and excluded and abort are held out.

randy-concepcion · 2026-06-12T16:55:02Z

/build-pr

randy-concepcion · 2026-06-13T04:01:07Z

Manual verification done in dev, in addition to the unit tests.

Verified manually

The three auth-stage excluded reasons, triggered with crafted requests and confirmed recording with the right labels and outcome="excluded"
- auth_rejected (valid service-type, invalid auth)
- invalid_service_type_for_model (valid model with a service-type it does not allow)
- invalid_auth_request (invalid purpose)
Real dev traffic exercised success and upstream_error, so the availability ratio is live, currently around 99% with dips that line up with upstream_error clusters
Confirmed the counter is scraped into Prometheus and queryable

Not verified manually
provisioning_failure, signup_cap_exceeded, and blocked need specific user or system state, so they were not triggered in dev. They are covered by the unit tests. auth_system_failure is defined but not emitted, which is the known auth limitation in the description.

Dashboard
Test dashboard built in Yardstick to validate the metric and compare against the existing success-rate panels.
[TEST] MLPA Availability

randy-concepcion · 2026-06-15T03:45:06Z

I've also added the same panels to our main Grafana dashboard so once these changes are deployed, we can start seeing them in Prod.

MLPA Availability Panels (Prod) - No data available until deployed.

noahpodgurski

Looks really good 🙌 just a few comments :)

noahpodgurski · 2026-06-15T16:58:58Z

+            AvailabilityReason.INVALID_SERVICE_TYPE_FOR_MODEL,
+            model=chat_request.model,
+            service_type=service_type.value,
+            purpose="",


question: Why not pass the chat_request.purpose here?

Good question. I found the chat_request object didn't carry the purpose field since it's a ChatRequest, but saw it in the AuthorizedChatRequest object (sub-class of ChatRequest).

I figured we should leave purpose set to "" as a placeholder until the request has been validated. I can add a comment here to make this more clear.

That function has a purpose header parameter we could use - Or are you saying since it hasn't been officially resolved with _resolve_purpose we shouldn't pass it, since the value could be anything.

I guess I'm fine with either :)

Ahh, yeah, you're right about the purpose parameter that was passed in, but your second point was exactly what I was thinking: "the value could be anything". So if we used the unresolved purpose, we'd be labeling the metric with a value that hasn't been validated yet, so it could be something arbitrary.

Hmm actually now that I think about it more, I think it makes more sense to always pass it. This data could be helpful in making correlations later on. Also, in this particular case, we're already passing the invalid service type, so if we're already passing invalid values then I think purpose should also be included. We can always add whitelisting filtering down the road. WDYT?

Gotcha. Yeah, those are fair points, the correlation value is worth having. Let me update it to pass purpose in. 👍

noahpodgurski · 2026-06-15T17:18:55Z

+    PrometheusRejectionReason, AvailabilityReason
+] = {
+    PrometheusRejectionReason.BUDGET_EXCEEDED: AvailabilityReason.BUDGET_EXCEEDED,
+    PrometheusRejectionReason.PAYLOAD_TOO_LARGE: AvailabilityReason.PAYLOAD_TOO_LARGE,


suggestion: We should also add tracking this to our middleware functions. For instance if request_size.check_request_size_middleware returns 413 record_chat_availability with PAYLOAD_TOO_LARGE should also fire

Ooo, I did not think of this. Great catch! I'll capture the PAYLOAD_TOO_LARGE for that check. One thing I noticed was that the middleware runs before the body is parsed and before auth, which means the model, service_type, and purpose aren't available yet. I'll set the "" placeholders for those and pass the reason and outcome through.

Actually all of those values are available under request.headers and request.body. I think it'd be a good idea to pass those in - or were you thinking to explicitly exclude them since the request hasn't yet been verified?

You're right! I think one thing we may want to avoid is reading in the whole body to get the model though since we're rejecting with the content-length check. Having service_type and purpose could still be useful info to have, so we can still pass those through.

randy-concepcion

Thanks for the review @noahpodgurski! I replied to each comment inline. I'll make some of these changes now and adjust according to your responses.

randy-concepcion · 2026-06-16T15:10:57Z

+            AvailabilityReason.INVALID_SERVICE_TYPE_FOR_MODEL,
+            model=chat_request.model,
+            service_type=service_type.value,
+            purpose="",


Good question. I found the chat_request object didn't carry the purpose field since it's a ChatRequest, but saw it in the AuthorizedChatRequest object (sub-class of ChatRequest).

I figured we should leave purpose set to "" as a placeholder until the request has been validated. I can add a comment here to make this more clear.

randy-concepcion · 2026-06-16T19:17:10Z

+    PrometheusRejectionReason, AvailabilityReason
+] = {
+    PrometheusRejectionReason.BUDGET_EXCEEDED: AvailabilityReason.BUDGET_EXCEEDED,
+    PrometheusRejectionReason.PAYLOAD_TOO_LARGE: AvailabilityReason.PAYLOAD_TOO_LARGE,


Ooo, I did not think of this. Great catch! I'll capture the PAYLOAD_TOO_LARGE for that check. One thing I noticed was that the middleware runs before the body is parsed and before auth, which means the model, service_type, and purpose aren't available yet. I'll set the "" placeholders for those and pass the reason and outcome through.

…_platform)

randy-concepcion · 2026-06-17T19:55:57Z

@noahpodgurski Thanks again for the feedback! I've pushed changes and is ready for another round of review when you have a chance! 🙏

randy-concepcion · 2026-06-17T20:09:22Z

/build-pr

randy-concepcion · 2026-06-17T22:00:52Z

Manually verified on dev. Grafana link for 2026-06-17 time window.

Prometheus: Metrics Explorer

MLPA Availability in Grafana (includes failures)

github-actions Bot temporarily deployed to build June 12, 2026 16:55 Inactive

randy-concepcion force-pushed the chat-availability-sli branch 2 times, most recently from 024ccc6 to 2d6922c Compare June 13, 2026 05:10

randy-concepcion requested review from a team, noahpodgurski and subpath and removed request for a team June 13, 2026 05:12

randy-concepcion marked this pull request as ready for review June 13, 2026 14:25

randy-concepcion requested a review from a team as a code owner June 13, 2026 14:25

randy-concepcion removed the request for review from a team June 13, 2026 14:26

noahpodgurski requested changes Jun 15, 2026

View reviewed changes

randy-concepcion commented Jun 16, 2026

View reviewed changes

randy-concepcion force-pushed the chat-availability-sli branch from 9f1be88 to ad67fd1 Compare June 16, 2026 21:43

randy-concepcion requested a review from noahpodgurski June 16, 2026 22:52

randy-concepcion added 7 commits June 17, 2026 09:59

add chat completion availability counter

9ffd350

instrument pre-completion chat availability

b48aeed

clean up comments and remove unused import

ab9d374

docs: clarify purpose placeholder and 5xx failure wording

75f3fe0

record PAYLOAD_TOO_LARGE availability in request-size middleware

b8d9271

order availability reasons by request stage

3e64fbf

rename availability reasons for clarity (valid_response, rate_limited…

2717537

…_platform)

randy-concepcion force-pushed the chat-availability-sli branch from ad67fd1 to 2717537 Compare June 17, 2026 13:59

populate service_type and purpose on pre-completion availability records

6318591

github-actions Bot deployed to build June 17, 2026 20:09 Active

Conversation

randy-concepcion commented Jun 12, 2026

What's New

Notes

Known Limitation

Uh oh!

randy-concepcion commented Jun 12, 2026

Uh oh!

randy-concepcion commented Jun 13, 2026

Uh oh!

randy-concepcion commented Jun 15, 2026

Uh oh!

noahpodgurski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

randy-concepcion left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

randy-concepcion commented Jun 17, 2026

Uh oh!

randy-concepcion commented Jun 17, 2026

Uh oh!

randy-concepcion commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

randy-concepcion commented Jun 17, 2026 •

edited

Loading