[Feature] Close the Terraform-vs-Bicep parity gap on items called out in the Troubleshooting guide

## What problem would this solve?

Several rows in the [README Troubleshooting table](../blob/main/README.md#troubleshooting) describe friction that is **Terraform-specific** — the same `azd up` run against the Bicep variant either does not hit the issue at all, or surfaces a much clearer error. Because we ship both variants side-by-side, the disparity is visible to anyone who tries the Terraform path.

Most of these are rooted in upstream provider gaps (and are tracked there — see [hashicorp/terraform-provider-azurerm#31140](https://github.com/hashicorp/terraform-provider-azurerm/issues/31140) for the biggest one), so we cannot fully close them in this repo. But for each, there is a meaningful **mitigation we can ship at this layer** — either a `precondition`, a CI guard, a default-value alignment, or a clearer in-product error — that would bring the Terraform experience much closer to the Bicep one.

This issue collects all of them in one place so they can be triaged and chipped away at together rather than rediscovered piecemeal.

## Concrete items

Each item below cites the troubleshooting row that motivates it, the **Bicep behavior** (baseline), the **current Terraform behavior**, and a **proposed mitigation that lives in this repo** (not upstream).

### 1. Opaque `400 715-123420` quota error from `azapi_resource`

- **Troubleshooting row:** *"Opaque `400 715-123420 ...` on the Terraform deployment step"* + *"Quota looks full but you have no live deployments"*.
- **Bicep:** ARM preflight translates the same condition into `InsufficientQuota: This operation require N new capacity in quota Tokens Per Minute (thousands) - Claude <model>, which is bigger than the current available capacity X.`
- **Terraform:** `azapi_resource` bypasses ARM preflight and the Cognitive Services RP returns the generic `715-123420 "An error occurred. Please reach out to support for additional assistance."` with no hint that quota is the cause.
- **Already mitigated in this repo:** the `preprovision` hook ([`scripts/preflight-claude.ps1`](../blob/main/scripts/preflight-claude.ps1)) runs a quota check and exits 6 with a clear message before `azd up` ever calls the RP.
- **Gap:** the preflight only runs under `azd up`. A user who runs `terraform apply` directly gets the raw opaque error.
- **Proposed:**
  - Add a `terraform_data` resource with a `precondition` block that reads quota via the `azapi_resource_action` data source (or a `local-exec` shelling out to `az cognitiveservices usage list`) and fails the plan if `currentValue + requestedCapacity > limit`. The plan-time error message would name the variable to lower.
  - Alternatively, add a `postcondition` on the deployment resource that pattern-matches `715-123420` in the error string and re-raises it with the same remediation text we already have in README.

### 2. Soft-deleted Cognitive accounts hold quota for 48 h, manifest as `715-123420`

- **Troubleshooting row:** *"Quota looks full but you have no live deployments"*.
- **Bicep:** same root cause but surfaces as `InsufficientQuota` (see #1), at least pointing the user at quota.
- **Terraform:** indistinguishable from any other 715-123420.
- **Already mitigated:** preflight + README walkthrough.
- **Proposed:** preflight should list soft-deleted accounts in the target region and warn (not fail) when found, so a fresh user does not have to learn this only after the deployment fails. Pseudo-output:
  ```
  Warning: 3 soft-deleted Cognitive accounts in eastus2 are holding ~50 TPM of
  claude-sonnet-4-6 quota. They will auto-purge in <date>. To free immediately:
     az cognitiveservices account purge ...
  ```

### 3. `azurerm_cognitive_account` / `azurerm_cognitive_deployment` cannot set `allowProjectManagement` + `modelProviderData`

- **Troubleshooting row:** *"AnthropicOrganizationCreationException / AnthropicOrganizationCreationFailed"* + the *"Why `modelProviderData` matters"* details block.
- **Bicep:** native resources at API version `2025-10-01-preview` support both fields.
- **Terraform:** we have to use `azapi_resource` with `schema_validation_enabled = false`. This is tracked upstream in [hashicorp/terraform-provider-azurerm#31140](https://github.com/hashicorp/terraform-provider-azurerm/issues/31140).
- **Already mitigated:** the Terraform variant uses `azapi_resource` everywhere and we set all three `modelProviderData` fields.
- **Proposed:**
  - Add a CI guard that fails the Terraform validate job if anyone changes `infra-terraform/infra/main.tf` to drop `schema_validation_enabled = false` or the three required `modelProviderData` keys (`organizationName`, `countryCode`, `industry`). The constraint is invisible at `terraform validate` time and easy to break by accident.
  - Add a watcher (link / Dependabot-style note) so when the upstream PR lands and `azurerm_cognitive_deployment` supports `modelProviderData` natively, we can migrate from `azapi_resource` and shorten `main.tf`.

### 4. Built-in role assignments break silently when Azure renames roles

- **Troubleshooting row:** not currently a row, but documented in [PR #2](../pull/2). Roles `Azure AI User` → `Foundry User` and `Azure AI Project Manager` → `Foundry Project Manager` got renamed by Azure mid-flight.
- **Bicep:** referenced the roles by GUID (`53ca6127-...`, `eadc314b-...`) → kept working through the rename.
- **Terraform (before PR #2):** referenced them by literal name → broke with `Role "Azure AI User" doesn't exist` until we converted to GUIDs.
- **Already mitigated:** Terraform now uses GUIDs.
- **Proposed:** add a one-line CI guard (grep) that fails if any literal Azure built-in role *name* appears in `infra-terraform/infra/*.tf` or `infra-bicep/infra/*.bicep` outside of comments/`(formerly ...)` parentheticals. Keeps us from regressing.

### 5. `ASSIGN_RBAC` defaults differ between variants

- **Troubleshooting row:** *"`403 Forbidden`"* + the [granting data-plane roles after `azd up`](../blob/main/README.md#granting-data-plane-roles-after-azd-up) one-liner.
- **Bicep (`main.parameters.json`):** `ASSIGN_RBAC=${ASSIGN_RBAC=false}`.
- **Terraform (`main.tfvars.json`):** `ASSIGN_RBAC=${ASSIGN_RBAC=false}`.
- Actually equal today after our recent alignment — **but** the Terraform variant has a documented quirk where data-plane role propagation lag manifests differently because the role assignments and the deployment are sequenced differently. The README walkthrough timing notes that the TF deployment is long enough for RBAC to settle in the same `azd up` run, but on Bicep the deployment is faster so a fresh `python src/hello_claude.py` may hit the intermittent 401 more often.
- **Proposed:** add a brief variant-parity note in `infra-terraform/azure.yaml` and the README walkthrough that calls out *why* the perceived 401 frequency differs across variants, instead of leaving users to discover it.

### 6. `variables.tf` defaults silently overridden by `main.tfvars.json` `${VAR=N}` literal

- **Troubleshooting row:** not a row, but burned us in PR #3 (capacity defaults lowered 50→25). Equivalent `main.parameters.json` issue on the Bicep side.
- **Bicep:** parameter default lives in `main.bicep`; `main.parameters.json` only injects env vars — no second source of truth.
- **Terraform:** `variables.tf` default is **shadowed** by the `${VAR=N}` literal in `main.tfvars.json` when the env var is unset. Lowering `variables.tf` to 25 while leaving `main.tfvars.json` at 50 produces a silent regression.
- **Already mitigated:** PR #3 aligned all four (haiku/sonnet/opus/legacy) on both sides.
- **Proposed:** CI guard that diffs the `=N` defaults in `*.tfvars.json` / `*.parameters.json` against their corresponding `variables.tf` / `main.bicep` defaults and fails on mismatch.

### 7. `hashicorp/setup-terraform@v3` + TF ≤ 1.9 = expired provider PGP key

- **Troubleshooting row:** not user-facing, but part of the [CI section in repo memory](#) and a maintenance hazard.
- **Bicep:** no equivalent — `bicep build` is self-contained.
- **Terraform CI:** older `terraform_version` (e.g. 1.6.0) ships with stale embedded PGP keys for the rotated HashiCorp signing key, so `terraform init` fails with `openpgp: key expired` on `hashicorp/azurerm` and `hashicorp/random`. We currently pin `1.10.0` to dodge this.
- **Proposed:** add a `.terraform-version` file at repo root (or `infra-terraform/.terraform-version`) honored by `tfenv` / `setup-terraform`, so the same TF version is enforced locally and in CI. Avoids "works in CI, fails locally" / vice versa on the `terraform fmt -check` drift the README walkthrough already documents.

### 8. `terraform fmt` output drifts across TF versions

- **Troubleshooting row:** not a row; an internal CI gotcha from our session memory.
- **Bicep:** N/A.
- **Terraform:** Local TF 1.15.4 and CI 1.10.0 format `local` blocks differently. Either fails the other's `fmt -check`.
- **Proposed:** covered by #7 — pin a single TF version.

---

## Out of scope (filed upstream, not solvable here)

- The root cause of `715-123420` itself (the RP returning a generic code). Filed informally via the threads on [hashicorp/terraform-provider-azurerm#31140](https://github.com/hashicorp/terraform-provider-azurerm/issues/31140) and the upstream MS service-team backlog item that `@promisinganuj` references.
- Native `modelProviderData` support in `azurerm_cognitive_deployment`. Tracked upstream.
- The undocumented `modelProviderData` REST property in the [`azure-rest-api-specs` Cognitive Services swagger](https://github.com/Azure/azure-rest-api-specs/blob/main/specification/cognitiveservices/resource-manager/Microsoft.CognitiveServices/stable/2025-09-01/cognitiveservices.json). Documentation gap is acknowledged by the service PM (per the same upstream thread).

## Acceptance criteria for closing this issue

- [ ] Items 1, 2, 3, 4, 6 have either a PR landed or an explicit "wontfix" rationale in a follow-up comment.
- [ ] Items 5, 7, 8 are addressed by a docs / version-pin PR.
- [ ] Troubleshooting table updated to reflect any rows that are no longer needed because the underlying friction was eliminated.

## Where does this change land?

- Terraform IaC (`infra-terraform/`) — most items.
- Scripts / hooks (`scripts/preflight-claude.*`) — items 1, 2.
- README / docs — items 5, troubleshooting cleanup.
- CI / GitHub config (`.github/workflows/validate.yml`) — items 3, 4, 6, 7, 8.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Close the Terraform-vs-Bicep parity gap on items called out in the Troubleshooting guide #19

What problem would this solve?

Concrete items

1. Opaque `400 715-123420` quota error from `azapi_resource`

2. Soft-deleted Cognitive accounts hold quota for 48 h, manifest as `715-123420`

3. `azurerm_cognitive_account` / `azurerm_cognitive_deployment` cannot set `allowProjectManagement` + `modelProviderData`

4. Built-in role assignments break silently when Azure renames roles

5. `ASSIGN_RBAC` defaults differ between variants

6. `variables.tf` defaults silently overridden by `main.tfvars.json` `${VAR=N}` literal

7. `hashicorp/setup-terraform@v3` + TF ≤ 1.9 = expired provider PGP key

8. `terraform fmt` output drifts across TF versions

Out of scope (filed upstream, not solvable here)

Acceptance criteria for closing this issue

Where does this change land?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Close the Terraform-vs-Bicep parity gap on items called out in the Troubleshooting guide #19

Description

What problem would this solve?

Concrete items

1. Opaque 400 715-123420 quota error from azapi_resource

2. Soft-deleted Cognitive accounts hold quota for 48 h, manifest as 715-123420

3. azurerm_cognitive_account / azurerm_cognitive_deployment cannot set allowProjectManagement + modelProviderData

4. Built-in role assignments break silently when Azure renames roles

5. ASSIGN_RBAC defaults differ between variants

6. variables.tf defaults silently overridden by main.tfvars.json ${VAR=N} literal

7. hashicorp/setup-terraform@v3 + TF ≤ 1.9 = expired provider PGP key

8. terraform fmt output drifts across TF versions

Out of scope (filed upstream, not solvable here)

Acceptance criteria for closing this issue

Where does this change land?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Opaque `400 715-123420` quota error from `azapi_resource`

2. Soft-deleted Cognitive accounts hold quota for 48 h, manifest as `715-123420`

3. `azurerm_cognitive_account` / `azurerm_cognitive_deployment` cannot set `allowProjectManagement` + `modelProviderData`

5. `ASSIGN_RBAC` defaults differ between variants

6. `variables.tf` defaults silently overridden by `main.tfvars.json` `${VAR=N}` literal

7. `hashicorp/setup-terraform@v3` + TF ≤ 1.9 = expired provider PGP key

8. `terraform fmt` output drifts across TF versions