refactor: Migrate Site API handlers to WithTx#562
Conversation
Continues the `WithTx` migration, covering all three `BeginTx` sites in `site.go` (`Create`, `Update`, `Delete`). Some callouts are: - All three use `WithTx` since they each have multiple outer-scope outputs the response builders need. - No `timeoutResp` needed here -- `Delete`'s `ExecuteDeleteSiteComponentsWorkflow` doesn't wait on `we.Get`, and `Create`'s `tnc.Register` and `Delete`'s `tosc.DeleteNamespace` / `csm.DeleteSite` are direct external calls (not *Temporal* flows). - Replaced `Create`'s defer-based Site Manager rollback/cleanup with an outer-scope `siteManagerCreated` flag, checked after `WithTx` returns an error. I think it reads a little cleaner, and only runs when the tx actually rolled back. - Errors switched to `NewAPIError`. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Summary by CodeRabbitRefactor
WalkthroughThis pull request refactors transaction handling across three Site handler methods—Create, Update, and Delete—from explicit ChangesHandler Transaction Pattern Refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-05-21 20:03:11 UTC | Commit: cddd25e |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@api/pkg/api/handler/site.go`:
- Around line 209-218: The StatusDetail creation block (sdDAO.CreateFromParams)
currently logs errors and continues, which can commit the outer transaction
without the matching status-history row; change both the create path and the
update path (the sdDAO.CreateFromParams calls around the current block and the
similar one at lines ~535-540) to treat any derr != nil or createdSSD == nil as
fatal: return an error (e.g., wrap/return the derr or return cutil.NewAPIError)
so the WithTx transaction handler will abort/rollback the tx instead of
proceeding, ensuring atomicity of the site mutation and status-history
insertions.
- Around line 1121-1127: Replace the brittle string-match check on derr with
typed error unwrapping: when handling the error returned from
OperatorService.DeleteNamespace (the derr variable near the site deletion flow
using stStrID), use errors.As to detect a *temporal.NamespaceNotFoundError and
call logger.Warn(..., "temporal namespace not found, continuing"); for all other
errors keep the logger.Error(...).Msg and return the cutil.NewAPIError(...) as
currently done. Ensure you import "errors" and reference
temporal.NamespaceNotFoundError in the type assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5aaf7fde-c2f4-4e33-83a7-a9dfd830e487
📒 Files selected for processing (1)
api/pkg/api/handler/site.go
| // Create status detail | ||
| createdSSD, derr := sdDAO.CreateFromParams(ctx, tx, st.ID.String(), *cdb.GetStrPtr(cdbm.SiteStatusPending), | ||
| cdb.GetStrPtr("received site creation request, pending pairing")) | ||
| if derr != nil { | ||
| logger.Error().Err(derr).Msg("error creating Status Detail DB entry") | ||
| } | ||
| if createdSSD == nil { | ||
| logger.Error().Msg("Status Detail DB entry not returned from CreateFromParams") | ||
| return cutil.NewAPIError(http.StatusInternalServerError, "Failed to get new Status Detail for Site", nil) | ||
| } |
There was a problem hiding this comment.
Abort the transaction when StatusDetail creation fails.
Both paths log CreateFromParams errors and continue. That allows the site mutation to commit without the matching status-history row, which breaks the atomicity this WithTx refactor is supposed to preserve. In the create path, the nil check only covers one return shape; the update path has no fallback at all.
Proposed fix
- createdSSD, derr := sdDAO.CreateFromParams(ctx, tx, st.ID.String(), *cdb.GetStrPtr(cdbm.SiteStatusPending),
- cdb.GetStrPtr("received site creation request, pending pairing"))
- if derr != nil {
- logger.Error().Err(derr).Msg("error creating Status Detail DB entry")
- }
+ createdSSD, derr := sdDAO.CreateFromParams(ctx, tx, st.ID.String(), *cdb.GetStrPtr(cdbm.SiteStatusPending),
+ cdb.GetStrPtr("received site creation request, pending pairing"))
+ if derr != nil {
+ logger.Error().Err(derr).Msg("error creating Status Detail DB entry")
+ return cutil.NewAPIError(http.StatusInternalServerError, "Failed to create Status Detail for Site", nil)
+ }
@@
- if status != nil {
- _, derr := sdDAO.CreateFromParams(ctx, tx, siteID.String(), *status, statusMessage)
- if derr != nil {
- logger.Error().Err(derr).Msg("error creating Status Detail DB entry")
- }
- }
+ if status != nil {
+ _, derr := sdDAO.CreateFromParams(ctx, tx, siteID.String(), *status, statusMessage)
+ if derr != nil {
+ logger.Error().Err(derr).Msg("error creating Status Detail DB entry")
+ return cutil.NewAPIError(http.StatusInternalServerError, "Failed to create Status Detail for Site", nil)
+ }
+ }Also applies to: 535-540
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api/pkg/api/handler/site.go` around lines 209 - 218, The StatusDetail
creation block (sdDAO.CreateFromParams) currently logs errors and continues,
which can commit the outer transaction without the matching status-history row;
change both the create path and the update path (the sdDAO.CreateFromParams
calls around the current block and the similar one at lines ~535-540) to treat
any derr != nil or createdSSD == nil as fatal: return an error (e.g.,
wrap/return the derr or return cutil.NewAPIError) so the WithTx transaction
handler will abort/rollback the tx instead of proceeding, ensuring atomicity of
the site mutation and status-history insertions.
There was a problem hiding this comment.
@coderabbitai So this was pre-existing logic that got forklifted to where it is. Leave this comment unresolved though -- maybe we can make the adjustment here as part of this PR. I'm guessing we probably will.
There was a problem hiding this comment.
@chet — understood, thanks for the context. The fact that this was pre-existing behavior is noted. That said, now that both call sites sit inside a WithTx closure that already propagates errors to trigger a rollback, the remediation is minimal — it is literally a matter of converting the "log and continue" branches into return cutil.NewAPIError(...) calls. The WithTx scaffolding already does all the heavy lifting; the fix costs essentially nothing relative to the benefit of preserving atomicity.
Leaving the comment open as requested. If you do fold the fix into this PR, the proposed diff in the review comment is production-ready as-is.
🧠 Learnings used
Learnt from: thossain-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 257
File: api/pkg/api/handler/vpcpeering.go:203-262
Timestamp: 2026-03-18T01:37:21.156Z
Learning: In Go handler code under api/pkg/api/handler, RBAC should not short-circuit on the first matching role (e.g., if isProviderAdmin { ... } else { ... }). For dual-role service accounts (e.g., Latinum), an org may have both ProviderAdmin and TenantAdmin. Evaluate both authorization paths independently, accumulate errors per path, and only reject if all applicable paths fail. Implement a helper that checks each role path and collects errors, then decide based on whether any path grants access. Add tests covering mixed-role scenarios and ensure CI gates enforce this behavior across all relevant handlers in this directory.
Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:09:55.214Z
Learning: In this codebase, the database interface is intentionally designed to persist the exact value it is given (including values derived from protobuf types, e.g., enum `.String()`). Repo-owned DB constants (e.g., `cdbm.VpcRoutingProfileTypeExternal`) are published for external callers/consumers, not as an indirection layer that the handler should use. Therefore, when reviewing this repository’s handlers, do not flag as a coupling issue the pattern of persisting a protobuf-derived string value directly via calls like `cdb.GetStrPtr(routingProfileTypeValue.String())`, as this is expected behavior rather than an accidental dependency on those string representations.
Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:22:33.123Z
Learning: Do not flag as an unacceptable coupling the persistence of protobuf enum values using the enum’s `.String()` output (e.g., storing `cdb.GetStrPtr(myProtoEnum.String())`) in inventory/round-trip paths. This codebase stores these `.String()` values because the gRPC back-end inventory process returns the same protobuf-derived string format, so DB round-trips must preserve the exact representation. If and when these inventory values change to plain strings with no backing proto enum, reassess the storage approach, but until then treat proto-enum `.String()` persistence as intentional and compatible.
Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 367
File: api/pkg/api/handler/machineinstancetype.go:593-606
Timestamp: 2026-04-10T16:39:25.404Z
Learning: In this codebase, Machine IDs are NOT UUIDs, while the deprecated Machine/InstanceType association IDs ARE UUIDs. When reviewing code that parses/validates incoming machine-related identifiers, treat `uuid.Parse` as a reliable discriminator between the deprecated UUID-based association IDs and the non-UUID Machine IDs. Do not raise/flag potential ambiguity between these two ID namespaces when a `uuid.Parse` check is used to separate them.
Learnt from: parmani-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 372
File: api/pkg/api/handler/fabric.go:137-143
Timestamp: 2026-04-15T23:49:00.903Z
Learning: In this repository, `cdb.ErrDoesNotExist` (from `db/pkg/db`) is a DB-layer sentinel error that is never wrapped. Therefore, in Go code that calls DAO/DB methods (e.g., `TenantSiteDAO.GetByTenantIDAndSiteID`), direct equality checks like `err == cdb.ErrDoesNotExist` / `err != cdb.ErrDoesNotExist` are intentional and safe. Do not flag these comparisons as robustness issues and do not suggest replacing them with `errors.Is()` for this specific sentinel.
Learnt from: thossain-nv
Repo: NVIDIA/infra-controller-rest PR: 462
File: api/pkg/api/handler/expectedswitch.go:158-214
Timestamp: 2026-05-01T21:21:59.704Z
Learning: For Expected* handler code under api/pkg/api/handler/, do NOT flag the pattern of executing Temporal workflows via common.ExecuteSyncWorkflow inside cdb.WithTx / cdb.WithTxResult transaction closures as a critical architectural issue. This codebase intentionally accepts the risk of DB commit failure after a successful Temporal call potentially diverging Site/Cloud state, based on: (1) DB transaction errors are rare, (2) certain Expected* inventory reconciliation can self-heal a Cloud rollback after a Site-side create, and (3) for deletes, IsMissingOnSite enables safe retry.
Learnt from: chet
Repo: NVIDIA/infra-controller-rest PR: 494
File: api/pkg/api/handler/nvlinklogicalpartition.go:210-215
Timestamp: 2026-05-07T06:30:32.433Z
Learning: In handler files under `api/pkg/api/handler/` that are still in the team’s mid-migration state toward the `WithTx` pattern (i.e., they follow the known interim flow where `timeoutResp != nil` is checked before evaluating the transaction `err` from `cdb.WithTx` / `cdb.WithTxResult`), do not raise or flag the `timeoutResp`-ordering issue. The intended follow-up in a later pass is to surface transaction errors before handling timeout/`TerminateWorkflowOnTimeOut` for consistency; until then, treat this ordering as intentional for these in-progress handlers.
Learnt from: chet
Repo: NVIDIA/infra-controller-rest PR: 562
File: api/pkg/api/handler/site.go:1121-1127
Timestamp: 2026-05-21T21:01:43.963Z
Learning: In `api/pkg/api/handler/` Go handler code, it is intentional to detect Temporal “namespace not found” errors (from `OperatorService.DeleteNamespace`) using `strings.Contains(derr.Error(), "not found")`. During code review, do not flag this as an anti-pattern and do not suggest replacing it with `errors.As` or typed Temporal error handling (e.g., `temporal.NamespaceNotFoundError`), since typed error handling would be an API-breaking change per prior review decisions.
| if derr != nil { | ||
| if strings.Contains(derr.Error(), "not found") { | ||
| logger.Warn().Str("Site ID", stStrID).Msg("temporal namespace not found, continuing") | ||
| } else { | ||
| logger.Error().Err(derr).Msg("error deleting Temporal namespace for Site") | ||
| return cutil.NewAPIError(http.StatusInternalServerError, "Failed to delete Site, error deleting workflow namespace", nil) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For go.temporal.io/sdk v1.39.0 and go.temporal.io/api v1.60.0, what concrete error type or gRPC status does OperatorService.DeleteNamespace return when the namespace does not exist, and what is the recommended Go pattern to detect it instead of checking strings.Contains(err.Error(), "not found")?
💡 Result:
OperatorService.DeleteNamespace (via the Go SDK/client) returns a Temporal-specific error whose cause is temporal.NamespaceNotFoundError, i.e., the underlying gRPC status is NOT_FOUND (code 5 / google.rpc.Code). You should detect it using Go error unwrapping/type checks (errors.As) against *temporal.NamespaceNotFoundError, rather than string-matching on err.Error content. Concrete recommended pattern: err:= operatorClient.DeleteNamespace(ctx, &operatorservice.DeleteNamespaceRequest{Namespace: ns}) if err!= nil { var nsNotFoundErr temporal.NamespaceNotFoundError if errors.As(err, &nsNotFoundErr) { // namespace does not exist } else { // other error } } Evidence: Temporal Go SDK exposes NamespaceNotFoundError as a typed internal error “set as the cause when failure is due namespace not found.” [1] The Temporal Go SDK documentation recommends handling errors by type using errors.As rather than inspecting string messages. [2]
Citations:
- 1: https://github.com/temporalio/sdk-go/blob/master/temporal/error.go
- 2: https://docs.temporal.io/develop/go/best-practices/error-handling
Typed Temporal error handling instead of string matching
The current logic checks strings.Contains(derr.Error(), "not found"), which is brittle and can both miss the real “namespace missing” case and incorrectly treat unrelated failures as ignorable.
OperatorService.DeleteNamespace surfaces “namespace does not exist” as a typed Temporal error whose cause is temporal.NamespaceNotFoundError (backed by gRPC NOT_FOUND). Handle it via error type/unwrapping (e.g., errors.As) rather than inspecting message text:
if derr != nil {
var nsNotFoundErr *temporal.NamespaceNotFoundError
if errors.As(derr, &nsNotFoundErr) {
logger.Warn().Str("Site ID", stStrID).Msg("temporal namespace not found, continuing")
} else {
logger.Error().Err(derr).Msg("error deleting Temporal namespace for Site")
return cutil.NewAPIError(http.StatusInternalServerError, "Failed to delete Site, error deleting workflow namespace", nil)
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api/pkg/api/handler/site.go` around lines 1121 - 1127, Replace the brittle
string-match check on derr with typed error unwrapping: when handling the error
returned from OperatorService.DeleteNamespace (the derr variable near the site
deletion flow using stStrID), use errors.As to detect a
*temporal.NamespaceNotFoundError and call logger.Warn(..., "temporal namespace
not found, continuing"); for all other errors keep the logger.Error(...).Msg and
return the cutil.NewAPIError(...) as currently done. Ensure you import "errors"
and reference temporal.NamespaceNotFoundError in the type assertion.
There was a problem hiding this comment.
@coderabbitai Yeah we do this in a few places, and if we introduced a type it'd change our API, so this will stay as-is. I think in another PR I tried to introduce a new type for the error, and the feedback was similar.
There was a problem hiding this comment.
@chet, understood — acknowledged and noted. Given that the string-match is a deliberate, codebase-wide convention to avoid an API-breaking typed error, I will suppress this suggestion in future reviews.
🐇 noted
✏️ Learnings added
Learnt from: chet
Repo: NVIDIA/infra-controller-rest PR: 562
File: api/pkg/api/handler/site.go:1121-1127
Timestamp: 2026-05-21T21:01:43.963Z
Learning: In `api/pkg/api/handler/site.go` (and across handler files in `api/pkg/api/handler/`), the pattern `strings.Contains(derr.Error(), "not found")` is used intentionally to detect Temporal "namespace not found" errors from `OperatorService.DeleteNamespace`. Introducing a typed error (e.g., `temporal.NamespaceNotFoundError`) would constitute an API-breaking change, and this approach was previously rejected in PR review. Do not flag or suggest replacing this string-based detection with `errors.As` / typed Temporal error handling.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: thossain-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 257
File: api/pkg/api/handler/vpcpeering.go:203-262
Timestamp: 2026-03-18T01:37:21.156Z
Learning: In Go handler code under api/pkg/api/handler, RBAC should not short-circuit on the first matching role (e.g., if isProviderAdmin { ... } else { ... }). For dual-role service accounts (e.g., Latinum), an org may have both ProviderAdmin and TenantAdmin. Evaluate both authorization paths independently, accumulate errors per path, and only reject if all applicable paths fail. Implement a helper that checks each role path and collects errors, then decide based on whether any path grants access. Add tests covering mixed-role scenarios and ensure CI gates enforce this behavior across all relevant handlers in this directory.
Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:09:55.214Z
Learning: In this codebase, the database interface is intentionally designed to persist the exact value it is given (including values derived from protobuf types, e.g., enum `.String()`). Repo-owned DB constants (e.g., `cdbm.VpcRoutingProfileTypeExternal`) are published for external callers/consumers, not as an indirection layer that the handler should use. Therefore, when reviewing this repository’s handlers, do not flag as a coupling issue the pattern of persisting a protobuf-derived string value directly via calls like `cdb.GetStrPtr(routingProfileTypeValue.String())`, as this is expected behavior rather than an accidental dependency on those string representations.
Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:22:33.123Z
Learning: Do not flag as an unacceptable coupling the persistence of protobuf enum values using the enum’s `.String()` output (e.g., storing `cdb.GetStrPtr(myProtoEnum.String())`) in inventory/round-trip paths. This codebase stores these `.String()` values because the gRPC back-end inventory process returns the same protobuf-derived string format, so DB round-trips must preserve the exact representation. If and when these inventory values change to plain strings with no backing proto enum, reassess the storage approach, but until then treat proto-enum `.String()` persistence as intentional and compatible.
Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 367
File: api/pkg/api/handler/machineinstancetype.go:593-606
Timestamp: 2026-04-10T16:39:25.404Z
Learning: In this codebase, Machine IDs are NOT UUIDs, while the deprecated Machine/InstanceType association IDs ARE UUIDs. When reviewing code that parses/validates incoming machine-related identifiers, treat `uuid.Parse` as a reliable discriminator between the deprecated UUID-based association IDs and the non-UUID Machine IDs. Do not raise/flag potential ambiguity between these two ID namespaces when a `uuid.Parse` check is used to separate them.
Learnt from: parmani-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 372
File: api/pkg/api/handler/fabric.go:137-143
Timestamp: 2026-04-15T23:49:00.903Z
Learning: In this repository, `cdb.ErrDoesNotExist` (from `db/pkg/db`) is a DB-layer sentinel error that is never wrapped. Therefore, in Go code that calls DAO/DB methods (e.g., `TenantSiteDAO.GetByTenantIDAndSiteID`), direct equality checks like `err == cdb.ErrDoesNotExist` / `err != cdb.ErrDoesNotExist` are intentional and safe. Do not flag these comparisons as robustness issues and do not suggest replacing them with `errors.Is()` for this specific sentinel.
Learnt from: thossain-nv
Repo: NVIDIA/infra-controller-rest PR: 462
File: api/pkg/api/handler/expectedswitch.go:158-214
Timestamp: 2026-05-01T21:21:59.704Z
Learning: For Expected* handler code under api/pkg/api/handler/, do NOT flag the pattern of executing Temporal workflows via common.ExecuteSyncWorkflow inside cdb.WithTx / cdb.WithTxResult transaction closures as a critical architectural issue. This codebase intentionally accepts the risk of DB commit failure after a successful Temporal call potentially diverging Site/Cloud state, based on: (1) DB transaction errors are rare, (2) certain Expected* inventory reconciliation can self-heal a Cloud rollback after a Site-side create, and (3) for deletes, IsMissingOnSite enables safe retry.
Learnt from: chet
Repo: NVIDIA/infra-controller-rest PR: 494
File: api/pkg/api/handler/nvlinklogicalpartition.go:210-215
Timestamp: 2026-05-07T06:30:32.433Z
Learning: In handler files under `api/pkg/api/handler/` that are still in the team’s mid-migration state toward the `WithTx` pattern (i.e., they follow the known interim flow where `timeoutResp != nil` is checked before evaluating the transaction `err` from `cdb.WithTx` / `cdb.WithTxResult`), do not raise or flag the `timeoutResp`-ordering issue. The intended follow-up in a later pass is to surface transaction errors before handling timeout/`TerminateWorkflowOnTimeOut` for consistency; until then, treat this ordering as intentional for these in-progress handlers.
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Description
Continues the
WithTxmigration, covering all threeBeginTxsites insite.go(Create,Update,Delete).Some callouts are:
WithTxsince they each have multiple outer-scope outputs the response builders need.timeoutRespneeded here --Delete'sExecuteDeleteSiteComponentsWorkflowdoesn't wait onwe.Get, andCreate'stnc.RegisterandDelete'stosc.DeleteNamespace/csm.DeleteSiteare direct external calls (not Temporal flows).Create's defer-based Site Manager rollback/cleanup with an outer-scopesiteManagerCreatedflag, checked afterWithTxreturns an error. I think it reads a little cleaner, and only runs when the tx actually rolled back.NewAPIError.Signed-off-by: Chet Nichols III chetn@nvidia.com
Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes