Skip to content

[Feature] Close the Terraform-vs-Bicep parity gap on items called out in the Troubleshooting guide #19

@achandmsft

Description

@achandmsft

What problem would this solve?

Several rows in the README Troubleshooting table describe friction that is Terraform-specific — the same azd up run against the Bicep variant either does not hit the issue at all, or surfaces a much clearer error. Because we ship both variants side-by-side, the disparity is visible to anyone who tries the Terraform path.

Most of these are rooted in upstream provider gaps (and are tracked there — see hashicorp/terraform-provider-azurerm#31140 for the biggest one), so we cannot fully close them in this repo. But for each, there is a meaningful mitigation we can ship at this layer — either a precondition, a CI guard, a default-value alignment, or a clearer in-product error — that would bring the Terraform experience much closer to the Bicep one.

This issue collects all of them in one place so they can be triaged and chipped away at together rather than rediscovered piecemeal.

Concrete items

Each item below cites the troubleshooting row that motivates it, the Bicep behavior (baseline), the current Terraform behavior, and a proposed mitigation that lives in this repo (not upstream).

1. Opaque 400 715-123420 quota error from azapi_resource

  • Troubleshooting row: "Opaque 400 715-123420 ... on the Terraform deployment step" + "Quota looks full but you have no live deployments".
  • Bicep: ARM preflight translates the same condition into InsufficientQuota: This operation require N new capacity in quota Tokens Per Minute (thousands) - Claude <model>, which is bigger than the current available capacity X.
  • Terraform: azapi_resource bypasses ARM preflight and the Cognitive Services RP returns the generic 715-123420 "An error occurred. Please reach out to support for additional assistance." with no hint that quota is the cause.
  • Already mitigated in this repo: the preprovision hook (scripts/preflight-claude.ps1) runs a quota check and exits 6 with a clear message before azd up ever calls the RP.
  • Gap: the preflight only runs under azd up. A user who runs terraform apply directly gets the raw opaque error.
  • Proposed:
    • Add a terraform_data resource with a precondition block that reads quota via the azapi_resource_action data source (or a local-exec shelling out to az cognitiveservices usage list) and fails the plan if currentValue + requestedCapacity > limit. The plan-time error message would name the variable to lower.
    • Alternatively, add a postcondition on the deployment resource that pattern-matches 715-123420 in the error string and re-raises it with the same remediation text we already have in README.

2. Soft-deleted Cognitive accounts hold quota for 48 h, manifest as 715-123420

  • Troubleshooting row: "Quota looks full but you have no live deployments".
  • Bicep: same root cause but surfaces as InsufficientQuota (see Simplify message printing in samples #1), at least pointing the user at quota.
  • Terraform: indistinguishable from any other 715-123420.
  • Already mitigated: preflight + README walkthrough.
  • Proposed: preflight should list soft-deleted accounts in the target region and warn (not fail) when found, so a fresh user does not have to learn this only after the deployment fails. Pseudo-output:
    Warning: 3 soft-deleted Cognitive accounts in eastus2 are holding ~50 TPM of
    claude-sonnet-4-6 quota. They will auto-purge in <date>. To free immediately:
       az cognitiveservices account purge ...
    

3. azurerm_cognitive_account / azurerm_cognitive_deployment cannot set allowProjectManagement + modelProviderData

  • Troubleshooting row: "AnthropicOrganizationCreationException / AnthropicOrganizationCreationFailed" + the "Why modelProviderData matters" details block.
  • Bicep: native resources at API version 2025-10-01-preview support both fields.
  • Terraform: we have to use azapi_resource with schema_validation_enabled = false. This is tracked upstream in hashicorp/terraform-provider-azurerm#31140.
  • Already mitigated: the Terraform variant uses azapi_resource everywhere and we set all three modelProviderData fields.
  • Proposed:
    • Add a CI guard that fails the Terraform validate job if anyone changes infra-terraform/infra/main.tf to drop schema_validation_enabled = false or the three required modelProviderData keys (organizationName, countryCode, industry). The constraint is invisible at terraform validate time and easy to break by accident.
    • Add a watcher (link / Dependabot-style note) so when the upstream PR lands and azurerm_cognitive_deployment supports modelProviderData natively, we can migrate from azapi_resource and shorten main.tf.

4. Built-in role assignments break silently when Azure renames roles

  • Troubleshooting row: not currently a row, but documented in PR #2. Roles Azure AI UserFoundry User and Azure AI Project ManagerFoundry Project Manager got renamed by Azure mid-flight.
  • Bicep: referenced the roles by GUID (53ca6127-..., eadc314b-...) → kept working through the rename.
  • Terraform (before PR Claude quota inspector, preflight, workspace-scoped az login, Foundry role rename #2): referenced them by literal name → broke with Role "Azure AI User" doesn't exist until we converted to GUIDs.
  • Already mitigated: Terraform now uses GUIDs.
  • Proposed: add a one-line CI guard (grep) that fails if any literal Azure built-in role name appears in infra-terraform/infra/*.tf or infra-bicep/infra/*.bicep outside of comments/(formerly ...) parentheticals. Keeps us from regressing.

5. ASSIGN_RBAC defaults differ between variants

  • Troubleshooting row: "403 Forbidden" + the granting data-plane roles after azd up one-liner.
  • Bicep (main.parameters.json): ASSIGN_RBAC=${ASSIGN_RBAC=false}.
  • Terraform (main.tfvars.json): ASSIGN_RBAC=${ASSIGN_RBAC=false}.
  • Actually equal today after our recent alignment — but the Terraform variant has a documented quirk where data-plane role propagation lag manifests differently because the role assignments and the deployment are sequenced differently. The README walkthrough timing notes that the TF deployment is long enough for RBAC to settle in the same azd up run, but on Bicep the deployment is faster so a fresh python src/hello_claude.py may hit the intermittent 401 more often.
  • Proposed: add a brief variant-parity note in infra-terraform/azure.yaml and the README walkthrough that calls out why the perceived 401 frequency differs across variants, instead of leaving users to discover it.

6. variables.tf defaults silently overridden by main.tfvars.json ${VAR=N} literal

  • Troubleshooting row: not a row, but burned us in PR Lower capacity defaults to 25 and auto-pin workspace Claude model #3 (capacity defaults lowered 50→25). Equivalent main.parameters.json issue on the Bicep side.
  • Bicep: parameter default lives in main.bicep; main.parameters.json only injects env vars — no second source of truth.
  • Terraform: variables.tf default is shadowed by the ${VAR=N} literal in main.tfvars.json when the env var is unset. Lowering variables.tf to 25 while leaving main.tfvars.json at 50 produces a silent regression.
  • Already mitigated: PR Lower capacity defaults to 25 and auto-pin workspace Claude model #3 aligned all four (haiku/sonnet/opus/legacy) on both sides.
  • Proposed: CI guard that diffs the =N defaults in *.tfvars.json / *.parameters.json against their corresponding variables.tf / main.bicep defaults and fails on mismatch.

7. hashicorp/setup-terraform@v3 + TF ≤ 1.9 = expired provider PGP key

  • Troubleshooting row: not user-facing, but part of the CI section in repo memory and a maintenance hazard.
  • Bicep: no equivalent — bicep build is self-contained.
  • Terraform CI: older terraform_version (e.g. 1.6.0) ships with stale embedded PGP keys for the rotated HashiCorp signing key, so terraform init fails with openpgp: key expired on hashicorp/azurerm and hashicorp/random. We currently pin 1.10.0 to dodge this.
  • Proposed: add a .terraform-version file at repo root (or infra-terraform/.terraform-version) honored by tfenv / setup-terraform, so the same TF version is enforced locally and in CI. Avoids "works in CI, fails locally" / vice versa on the terraform fmt -check drift the README walkthrough already documents.

8. terraform fmt output drifts across TF versions

  • Troubleshooting row: not a row; an internal CI gotcha from our session memory.
  • Bicep: N/A.
  • Terraform: Local TF 1.15.4 and CI 1.10.0 format local blocks differently. Either fails the other's fmt -check.
  • Proposed: covered by gha(deps): bump hashicorp/setup-terraform from 3 to 4 #7 — pin a single TF version.

Out of scope (filed upstream, not solvable here)

  • The root cause of 715-123420 itself (the RP returning a generic code). Filed informally via the threads on hashicorp/terraform-provider-azurerm#31140 and the upstream MS service-team backlog item that @promisinganuj references.
  • Native modelProviderData support in azurerm_cognitive_deployment. Tracked upstream.
  • The undocumented modelProviderData REST property in the azure-rest-api-specs Cognitive Services swagger. Documentation gap is acknowledged by the service PM (per the same upstream thread).

Acceptance criteria for closing this issue

  • Items 1, 2, 3, 4, 6 have either a PR landed or an explicit "wontfix" rationale in a follow-up comment.
  • Items 5, 7, 8 are addressed by a docs / version-pin PR.
  • Troubleshooting table updated to reflect any rows that are no longer needed because the underlying friction was eliminated.

Where does this change land?

  • Terraform IaC (infra-terraform/) — most items.
  • Scripts / hooks (scripts/preflight-claude.*) — items 1, 2.
  • README / docs — items 5, troubleshooting cleanup.
  • CI / GitHub config (.github/workflows/validate.yml) — items 3, 4, 6, 7, 8.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions