feat: Instrument pre-completion chat availability#173
Conversation
|
/build-pr |
|
Manual verification done in Verified manually
Not verified manually Dashboard
|
024ccc6 to
2d6922c
Compare
|
I've also added the same panels to our main Grafana dashboard so once these changes are deployed, we can start seeing them in Prod. MLPA Availability Panels (Prod) - No data available until deployed. |
noahpodgurski
left a comment
There was a problem hiding this comment.
Looks really good 🙌 just a few comments :)
| AvailabilityReason.INVALID_SERVICE_TYPE_FOR_MODEL, | ||
| model=chat_request.model, | ||
| service_type=service_type.value, | ||
| purpose="", |
There was a problem hiding this comment.
question: Why not pass the chat_request.purpose here?
There was a problem hiding this comment.
Good question. I found the chat_request object didn't carry the purpose field since it's a ChatRequest, but saw it in the AuthorizedChatRequest object (sub-class of ChatRequest).
I figured we should leave purpose set to "" as a placeholder until the request has been validated. I can add a comment here to make this more clear.
There was a problem hiding this comment.
That function has a purpose header parameter we could use - Or are you saying since it hasn't been officially resolved with _resolve_purpose we shouldn't pass it, since the value could be anything.
I guess I'm fine with either :)
There was a problem hiding this comment.
Ahh, yeah, you're right about the purpose parameter that was passed in, but your second point was exactly what I was thinking: "the value could be anything". So if we used the unresolved purpose, we'd be labeling the metric with a value that hasn't been validated yet, so it could be something arbitrary.
There was a problem hiding this comment.
Hmm actually now that I think about it more, I think it makes more sense to always pass it. This data could be helpful in making correlations later on. Also, in this particular case, we're already passing the invalid service type, so if we're already passing invalid values then I think purpose should also be included. We can always add whitelisting filtering down the road. WDYT?
There was a problem hiding this comment.
Gotcha. Yeah, those are fair points, the correlation value is worth having. Let me update it to pass purpose in. 👍
| PrometheusRejectionReason, AvailabilityReason | ||
| ] = { | ||
| PrometheusRejectionReason.BUDGET_EXCEEDED: AvailabilityReason.BUDGET_EXCEEDED, | ||
| PrometheusRejectionReason.PAYLOAD_TOO_LARGE: AvailabilityReason.PAYLOAD_TOO_LARGE, |
There was a problem hiding this comment.
suggestion: We should also add tracking this to our middleware functions. For instance if request_size.check_request_size_middleware returns 413 record_chat_availability with PAYLOAD_TOO_LARGE should also fire
There was a problem hiding this comment.
Ooo, I did not think of this. Great catch! I'll capture the PAYLOAD_TOO_LARGE for that check. One thing I noticed was that the middleware runs before the body is parsed and before auth, which means the model, service_type, and purpose aren't available yet. I'll set the "" placeholders for those and pass the reason and outcome through.
There was a problem hiding this comment.
Actually all of those values are available under request.headers and request.body. I think it'd be a good idea to pass those in - or were you thinking to explicitly exclude them since the request hasn't yet been verified?
There was a problem hiding this comment.
You're right! I think one thing we may want to avoid is reading in the whole body to get the model though since we're rejecting with the content-length check. Having service_type and purpose could still be useful info to have, so we can still pass those through.
randy-concepcion
left a comment
There was a problem hiding this comment.
Thanks for the review @noahpodgurski! I replied to each comment inline. I'll make some of these changes now and adjust according to your responses.
| AvailabilityReason.INVALID_SERVICE_TYPE_FOR_MODEL, | ||
| model=chat_request.model, | ||
| service_type=service_type.value, | ||
| purpose="", |
There was a problem hiding this comment.
Good question. I found the chat_request object didn't carry the purpose field since it's a ChatRequest, but saw it in the AuthorizedChatRequest object (sub-class of ChatRequest).
I figured we should leave purpose set to "" as a placeholder until the request has been validated. I can add a comment here to make this more clear.
| PrometheusRejectionReason, AvailabilityReason | ||
| ] = { | ||
| PrometheusRejectionReason.BUDGET_EXCEEDED: AvailabilityReason.BUDGET_EXCEEDED, | ||
| PrometheusRejectionReason.PAYLOAD_TOO_LARGE: AvailabilityReason.PAYLOAD_TOO_LARGE, |
There was a problem hiding this comment.
Ooo, I did not think of this. Great catch! I'll capture the PAYLOAD_TOO_LARGE for that check. One thing I noticed was that the middleware runs before the body is parsed and before auth, which means the model, service_type, and purpose aren't available yet. I'll set the "" placeholders for those and pass the reason and outcome through.
9f1be88 to
ad67fd1
Compare
ad67fd1 to
2717537
Compare
|
@noahpodgurski Thanks again for the feedback! I've pushed changes and is ready for another round of review when you have a chance! 🙏 |
|
/build-pr |
|
Manually verified on |




What's New
Adds pre-completion coverage to
mlpa_chat_availability_total, so auth rejections and user-provisioning failures show up in the chat availability signal.A chat request terminates at one of four points and records exactly one outcome:
authorize_chat_request) - all excluded:invalid_service_type_for_modelauth_rejectedinvalid_auth_requestget_or_create_user_for_completion):signup_cap_exceeded(excluded)provisioning_failure(failure, 5xx only)blocked(excluded)get_completion/stream_completion): already instrumentedNotes
HTTPExceptionand records only 401 and 400. Everything else, including App Attest's explicit 500, is re-raised without recording.invalid_auth_requestrather thaninvalid_purposebecause malformed App Attest base64 reaches the same path as purpose validation.auth_system_failureis defined but not emitted. See the limitation below.Known Limitation
This is an interim measurement. For FxA and Play Integrity, the same
401response is returned whether the request was correctly rejected (such as an expired token) or the authentication system itself had a problem (such as an outage). Since they can't be distinguished, all of these are counted as expected rejections and left out of the ratio for now.This means that real authentication-system failures are not yet counted as failures. This is a known gap, already noted in the Availability RFC and the SLI Inventory, and it will be addressed once those failures can be identified separately.
The availability ratio is
success / (success + failure), andexcludedandabortare held out.