Skip to content

bug: Cognitive Services OpenAI User role not assigned to managed identity — 401 on all chat completions #23

Description

@vrajakishore

Summary

After a successful devclaw up / devclaw deploy, the container starts and the gateway reports ready — but every chat message fails with a 401. The container app's managed identity is never granted the Cognitive Services OpenAI User role on the Azure OpenAI resource.

Error seen in container logs

[agent/embedded] embedded run agent end: isError=true model=gpt-5.4-mini
error=LLM request failed. rawError=401 The principal `<principalId>` lacks
the required data action `Microsoft.CognitiveServices/accounts/OpenAI/
deployments/chat/completions/action` to perform
`POST /openai/v1/chat/completions` operation.

Steps to reproduce

  1. Complete devclaw up (even successfully)
  2. Open the WebChat UI — gateway shows Online, model shows gpt-5.4-mini
  3. Send any message
  4. Assistant returns [assistant turn failed before producing content]
  5. Container logs show 401 on every /openai/v1/chat/completions call

Root cause

Same pattern as the AcrPull gap. Bicep configures the container app to use AZURE_OPENAI_AUTH=managed-identity but does not create the Cognitive Services OpenAI User role assignment on the Azure OpenAI resource for the container app's principalId. The gateway has no API key to fall back to (by design), so every model call fails.

Expected behaviour

Bicep should create a Cognitive Services OpenAI User role assignment on the Azure OpenAI resource for the container app's system-assigned managed identity, with correct dependsOn so it is in place before the first request.

Manual workaround

PRINCIPAL_ID=$(az containerapp show -n <app> -g <rg> --query 'identity.principalId' -o tsv)
OPENAI_ID=$(az cognitiveservices account list -g <rg> --query '[0].id' -o tsv)
az role assignment create \
  --assignee-object-id $PRINCIPAL_ID \
  --assignee-principal-type ServicePrincipal \
  --role "Cognitive Services OpenAI User" \
  --scope $OPENAI_ID

Note: RBAC propagation takes ~90 seconds after assignment. The gateway does not need to be restarted.

Related

Same root cause as #22 (AcrPull not assigned). Both point to Bicep RBAC sequencing gaps that surface when azd up exits before full propagation.

Environment

  • Region: southindia
  • Restricted corporate subscription with Azure Policy assignments active

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions