Skip to content

refactor: Migrate Site API handlers to WithTx#562

Open
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:with-tx-site
Open

refactor: Migrate Site API handlers to WithTx#562
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:with-tx-site

Conversation

@chet
Copy link
Copy Markdown
Contributor

@chet chet commented May 21, 2026

Description

Continues the WithTx migration, covering all three BeginTx sites in site.go (Create, Update, Delete).

Some callouts are:

  • All three use WithTx since they each have multiple outer-scope outputs the response builders need.
  • No timeoutResp needed here -- Delete's ExecuteDeleteSiteComponentsWorkflow doesn't wait on we.Get, and Create's tnc.Register and Delete's tosc.DeleteNamespace / csm.DeleteSite are direct external calls (not Temporal flows).
  • Replaced Create's defer-based Site Manager rollback/cleanup with an outer-scope siteManagerCreated flag, checked after WithTx returns an error. I think it reads a little cleaner, and only runs when the tx actually rolled back.
  • Errors switched to NewAPIError.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • Flow - Flow service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Continues the `WithTx` migration, covering all three `BeginTx` sites in `site.go` (`Create`, `Update`, `Delete`).

Some callouts are:
- All three use `WithTx` since they each have multiple outer-scope outputs the response builders need.
- No `timeoutResp` needed here -- `Delete`'s `ExecuteDeleteSiteComponentsWorkflow` doesn't wait on `we.Get`, and `Create`'s `tnc.Register` and `Delete`'s `tosc.DeleteNamespace` / `csm.DeleteSite` are direct external calls (not *Temporal* flows).
- Replaced `Create`'s defer-based Site Manager rollback/cleanup with an outer-scope `siteManagerCreated` flag, checked after `WithTx` returns an error. I think it reads a little cleaner, and only runs when the tx actually rolled back.
- Errors switched to `NewAPIError`.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet requested a review from a team as a code owner May 21, 2026 20:01
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Review Change Stack

Summary by CodeRabbit

Refactor

  • Optimized database transaction handling for Site creation, updates, and deletion operations to ensure better reliability and consistency.

Walkthrough

This pull request refactors transaction handling across three Site handler methods—Create, Update, and Delete—from explicit BeginTx/Commit/Rollback patterns to idiomatic cdb.WithTx() closures, eliminating the database/sql dependency and standardizing error handling through callback semantics.

Changes

Handler Transaction Pattern Refactoring

Layer / File(s) Summary
Import cleanup and CreateSiteHandler refactoring
api/pkg/api/handler/site.go
Removes database/sql import. Refactors CreateSiteHandler to wrap Site/StatusDetail creation, Site-Manager entry creation, OTP update, and Temporal namespace registration in a cdb.WithTx closure. Site-Manager undo bookkeeping (siteManagerURL, siteManagerCreated) is moved outside transaction scope; failure cleanup executes only after WithTx returns an error, replacing the prior defer-based pattern.
UpdateSiteHandler refactoring
api/pkg/api/handler/site.go
Converts UpdateSiteHandler to cdb.WithTx, moving Site update, conditional StatusDetail creation, and StatusDetail read inside the transaction closure for read-your-writes consistency. Removes explicit commit/rollback and txCommitted state machine; error handling is delegated to common.HandleTxError.
DeleteSiteHandler refactoring
api/pkg/api/handler/site.go
Converts DeleteSiteHandler to cdb.WithTx, enclosing Site deletion, Temporal namespace deletion (with string-based "not found" detection via strings.Contains), and optional Site-Manager deletion (tolerating csm.ErrSiteNotFound with warning). Removes explicit transaction scaffolding and rollback logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'refactor: Migrate Site API handlers to WithTx' directly and clearly describes the primary change: converting three API handlers from BeginTx to the WithTx pattern.
Description check ✅ Passed The description thoroughly documents the refactoring scope, implementation details, design decisions (defer replacement, no timeoutResp needed), and error handling changes—all directly related to the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-05-21 20:03:11 UTC | Commit: cddd25e

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/pkg/api/handler/site.go`:
- Around line 209-218: The StatusDetail creation block (sdDAO.CreateFromParams)
currently logs errors and continues, which can commit the outer transaction
without the matching status-history row; change both the create path and the
update path (the sdDAO.CreateFromParams calls around the current block and the
similar one at lines ~535-540) to treat any derr != nil or createdSSD == nil as
fatal: return an error (e.g., wrap/return the derr or return cutil.NewAPIError)
so the WithTx transaction handler will abort/rollback the tx instead of
proceeding, ensuring atomicity of the site mutation and status-history
insertions.
- Around line 1121-1127: Replace the brittle string-match check on derr with
typed error unwrapping: when handling the error returned from
OperatorService.DeleteNamespace (the derr variable near the site deletion flow
using stStrID), use errors.As to detect a *temporal.NamespaceNotFoundError and
call logger.Warn(..., "temporal namespace not found, continuing"); for all other
errors keep the logger.Error(...).Msg and return the cutil.NewAPIError(...) as
currently done. Ensure you import "errors" and reference
temporal.NamespaceNotFoundError in the type assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5aaf7fde-c2f4-4e33-83a7-a9dfd830e487

📥 Commits

Reviewing files that changed from the base of the PR and between f813f8d and cddd25e.

📒 Files selected for processing (1)
  • api/pkg/api/handler/site.go

Comment on lines +209 to +218
// Create status detail
createdSSD, derr := sdDAO.CreateFromParams(ctx, tx, st.ID.String(), *cdb.GetStrPtr(cdbm.SiteStatusPending),
cdb.GetStrPtr("received site creation request, pending pairing"))
if derr != nil {
logger.Error().Err(derr).Msg("error creating Status Detail DB entry")
}
if createdSSD == nil {
logger.Error().Msg("Status Detail DB entry not returned from CreateFromParams")
return cutil.NewAPIError(http.StatusInternalServerError, "Failed to get new Status Detail for Site", nil)
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Abort the transaction when StatusDetail creation fails.

Both paths log CreateFromParams errors and continue. That allows the site mutation to commit without the matching status-history row, which breaks the atomicity this WithTx refactor is supposed to preserve. In the create path, the nil check only covers one return shape; the update path has no fallback at all.

Proposed fix
-		createdSSD, derr := sdDAO.CreateFromParams(ctx, tx, st.ID.String(), *cdb.GetStrPtr(cdbm.SiteStatusPending),
-			cdb.GetStrPtr("received site creation request, pending pairing"))
-		if derr != nil {
-			logger.Error().Err(derr).Msg("error creating Status Detail DB entry")
-		}
+		createdSSD, derr := sdDAO.CreateFromParams(ctx, tx, st.ID.String(), *cdb.GetStrPtr(cdbm.SiteStatusPending),
+			cdb.GetStrPtr("received site creation request, pending pairing"))
+		if derr != nil {
+			logger.Error().Err(derr).Msg("error creating Status Detail DB entry")
+			return cutil.NewAPIError(http.StatusInternalServerError, "Failed to create Status Detail for Site", nil)
+		}
@@
-		if status != nil {
-			_, derr := sdDAO.CreateFromParams(ctx, tx, siteID.String(), *status, statusMessage)
-			if derr != nil {
-				logger.Error().Err(derr).Msg("error creating Status Detail DB entry")
-			}
-		}
+		if status != nil {
+			_, derr := sdDAO.CreateFromParams(ctx, tx, siteID.String(), *status, statusMessage)
+			if derr != nil {
+				logger.Error().Err(derr).Msg("error creating Status Detail DB entry")
+				return cutil.NewAPIError(http.StatusInternalServerError, "Failed to create Status Detail for Site", nil)
+			}
+		}

Also applies to: 535-540

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/pkg/api/handler/site.go` around lines 209 - 218, The StatusDetail
creation block (sdDAO.CreateFromParams) currently logs errors and continues,
which can commit the outer transaction without the matching status-history row;
change both the create path and the update path (the sdDAO.CreateFromParams
calls around the current block and the similar one at lines ~535-540) to treat
any derr != nil or createdSSD == nil as fatal: return an error (e.g.,
wrap/return the derr or return cutil.NewAPIError) so the WithTx transaction
handler will abort/rollback the tx instead of proceeding, ensuring atomicity of
the site mutation and status-history insertions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai So this was pre-existing logic that got forklifted to where it is. Leave this comment unresolved though -- maybe we can make the adjustment here as part of this PR. I'm guessing we probably will.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chet — understood, thanks for the context. The fact that this was pre-existing behavior is noted. That said, now that both call sites sit inside a WithTx closure that already propagates errors to trigger a rollback, the remediation is minimal — it is literally a matter of converting the "log and continue" branches into return cutil.NewAPIError(...) calls. The WithTx scaffolding already does all the heavy lifting; the fix costs essentially nothing relative to the benefit of preserving atomicity.

Leaving the comment open as requested. If you do fold the fix into this PR, the proposed diff in the review comment is production-ready as-is.


🧠 Learnings used
Learnt from: thossain-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 257
File: api/pkg/api/handler/vpcpeering.go:203-262
Timestamp: 2026-03-18T01:37:21.156Z
Learning: In Go handler code under api/pkg/api/handler, RBAC should not short-circuit on the first matching role (e.g., if isProviderAdmin { ... } else { ... }). For dual-role service accounts (e.g., Latinum), an org may have both ProviderAdmin and TenantAdmin. Evaluate both authorization paths independently, accumulate errors per path, and only reject if all applicable paths fail. Implement a helper that checks each role path and collects errors, then decide based on whether any path grants access. Add tests covering mixed-role scenarios and ensure CI gates enforce this behavior across all relevant handlers in this directory.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:09:55.214Z
Learning: In this codebase, the database interface is intentionally designed to persist the exact value it is given (including values derived from protobuf types, e.g., enum `.String()`). Repo-owned DB constants (e.g., `cdbm.VpcRoutingProfileTypeExternal`) are published for external callers/consumers, not as an indirection layer that the handler should use. Therefore, when reviewing this repository’s handlers, do not flag as a coupling issue the pattern of persisting a protobuf-derived string value directly via calls like `cdb.GetStrPtr(routingProfileTypeValue.String())`, as this is expected behavior rather than an accidental dependency on those string representations.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:22:33.123Z
Learning: Do not flag as an unacceptable coupling the persistence of protobuf enum values using the enum’s `.String()` output (e.g., storing `cdb.GetStrPtr(myProtoEnum.String())`) in inventory/round-trip paths. This codebase stores these `.String()` values because the gRPC back-end inventory process returns the same protobuf-derived string format, so DB round-trips must preserve the exact representation. If and when these inventory values change to plain strings with no backing proto enum, reassess the storage approach, but until then treat proto-enum `.String()` persistence as intentional and compatible.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 367
File: api/pkg/api/handler/machineinstancetype.go:593-606
Timestamp: 2026-04-10T16:39:25.404Z
Learning: In this codebase, Machine IDs are NOT UUIDs, while the deprecated Machine/InstanceType association IDs ARE UUIDs. When reviewing code that parses/validates incoming machine-related identifiers, treat `uuid.Parse` as a reliable discriminator between the deprecated UUID-based association IDs and the non-UUID Machine IDs. Do not raise/flag potential ambiguity between these two ID namespaces when a `uuid.Parse` check is used to separate them.

Learnt from: parmani-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 372
File: api/pkg/api/handler/fabric.go:137-143
Timestamp: 2026-04-15T23:49:00.903Z
Learning: In this repository, `cdb.ErrDoesNotExist` (from `db/pkg/db`) is a DB-layer sentinel error that is never wrapped. Therefore, in Go code that calls DAO/DB methods (e.g., `TenantSiteDAO.GetByTenantIDAndSiteID`), direct equality checks like `err == cdb.ErrDoesNotExist` / `err != cdb.ErrDoesNotExist` are intentional and safe. Do not flag these comparisons as robustness issues and do not suggest replacing them with `errors.Is()` for this specific sentinel.

Learnt from: thossain-nv
Repo: NVIDIA/infra-controller-rest PR: 462
File: api/pkg/api/handler/expectedswitch.go:158-214
Timestamp: 2026-05-01T21:21:59.704Z
Learning: For Expected* handler code under api/pkg/api/handler/, do NOT flag the pattern of executing Temporal workflows via common.ExecuteSyncWorkflow inside cdb.WithTx / cdb.WithTxResult transaction closures as a critical architectural issue. This codebase intentionally accepts the risk of DB commit failure after a successful Temporal call potentially diverging Site/Cloud state, based on: (1) DB transaction errors are rare, (2) certain Expected* inventory reconciliation can self-heal a Cloud rollback after a Site-side create, and (3) for deletes, IsMissingOnSite enables safe retry.

Learnt from: chet
Repo: NVIDIA/infra-controller-rest PR: 494
File: api/pkg/api/handler/nvlinklogicalpartition.go:210-215
Timestamp: 2026-05-07T06:30:32.433Z
Learning: In handler files under `api/pkg/api/handler/` that are still in the team’s mid-migration state toward the `WithTx` pattern (i.e., they follow the known interim flow where `timeoutResp != nil` is checked before evaluating the transaction `err` from `cdb.WithTx` / `cdb.WithTxResult`), do not raise or flag the `timeoutResp`-ordering issue. The intended follow-up in a later pass is to surface transaction errors before handling timeout/`TerminateWorkflowOnTimeOut` for consistency; until then, treat this ordering as intentional for these in-progress handlers.

Learnt from: chet
Repo: NVIDIA/infra-controller-rest PR: 562
File: api/pkg/api/handler/site.go:1121-1127
Timestamp: 2026-05-21T21:01:43.963Z
Learning: In `api/pkg/api/handler/` Go handler code, it is intentional to detect Temporal “namespace not found” errors (from `OperatorService.DeleteNamespace`) using `strings.Contains(derr.Error(), "not found")`. During code review, do not flag this as an anti-pattern and do not suggest replacing it with `errors.As` or typed Temporal error handling (e.g., `temporal.NamespaceNotFoundError`), since typed error handling would be an API-breaking change per prior review decisions.

Comment on lines +1121 to +1127
if derr != nil {
if strings.Contains(derr.Error(), "not found") {
logger.Warn().Str("Site ID", stStrID).Msg("temporal namespace not found, continuing")
} else {
logger.Error().Err(derr).Msg("error deleting Temporal namespace for Site")
return cutil.NewAPIError(http.StatusInternalServerError, "Failed to delete Site, error deleting workflow namespace", nil)
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For go.temporal.io/sdk v1.39.0 and go.temporal.io/api v1.60.0, what concrete error type or gRPC status does OperatorService.DeleteNamespace return when the namespace does not exist, and what is the recommended Go pattern to detect it instead of checking strings.Contains(err.Error(), "not found")?

💡 Result:

OperatorService.DeleteNamespace (via the Go SDK/client) returns a Temporal-specific error whose cause is temporal.NamespaceNotFoundError, i.e., the underlying gRPC status is NOT_FOUND (code 5 / google.rpc.Code). You should detect it using Go error unwrapping/type checks (errors.As) against *temporal.NamespaceNotFoundError, rather than string-matching on err.Error content. Concrete recommended pattern: err:= operatorClient.DeleteNamespace(ctx, &operatorservice.DeleteNamespaceRequest{Namespace: ns}) if err!= nil { var nsNotFoundErr temporal.NamespaceNotFoundError if errors.As(err, &nsNotFoundErr) { // namespace does not exist } else { // other error } } Evidence: Temporal Go SDK exposes NamespaceNotFoundError as a typed internal error “set as the cause when failure is due namespace not found.” [1] The Temporal Go SDK documentation recommends handling errors by type using errors.As rather than inspecting string messages. [2]

Citations:


Typed Temporal error handling instead of string matching

The current logic checks strings.Contains(derr.Error(), "not found"), which is brittle and can both miss the real “namespace missing” case and incorrectly treat unrelated failures as ignorable.

OperatorService.DeleteNamespace surfaces “namespace does not exist” as a typed Temporal error whose cause is temporal.NamespaceNotFoundError (backed by gRPC NOT_FOUND). Handle it via error type/unwrapping (e.g., errors.As) rather than inspecting message text:

if derr != nil {
	var nsNotFoundErr *temporal.NamespaceNotFoundError
	if errors.As(derr, &nsNotFoundErr) {
		logger.Warn().Str("Site ID", stStrID).Msg("temporal namespace not found, continuing")
	} else {
		logger.Error().Err(derr).Msg("error deleting Temporal namespace for Site")
		return cutil.NewAPIError(http.StatusInternalServerError, "Failed to delete Site, error deleting workflow namespace", nil)
	}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/pkg/api/handler/site.go` around lines 1121 - 1127, Replace the brittle
string-match check on derr with typed error unwrapping: when handling the error
returned from OperatorService.DeleteNamespace (the derr variable near the site
deletion flow using stStrID), use errors.As to detect a
*temporal.NamespaceNotFoundError and call logger.Warn(..., "temporal namespace
not found, continuing"); for all other errors keep the logger.Error(...).Msg and
return the cutil.NewAPIError(...) as currently done. Ensure you import "errors"
and reference temporal.NamespaceNotFoundError in the type assertion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai Yeah we do this in a few places, and if we introduced a type it'd change our API, so this will stay as-is. I think in another PR I tried to introduce a new type for the error, and the feedback was similar.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chet, understood — acknowledged and noted. Given that the string-match is a deliberate, codebase-wide convention to avoid an API-breaking typed error, I will suppress this suggestion in future reviews.

🐇 noted


✏️ Learnings added
Learnt from: chet
Repo: NVIDIA/infra-controller-rest PR: 562
File: api/pkg/api/handler/site.go:1121-1127
Timestamp: 2026-05-21T21:01:43.963Z
Learning: In `api/pkg/api/handler/site.go` (and across handler files in `api/pkg/api/handler/`), the pattern `strings.Contains(derr.Error(), "not found")` is used intentionally to detect Temporal "namespace not found" errors from `OperatorService.DeleteNamespace`. Introducing a typed error (e.g., `temporal.NamespaceNotFoundError`) would constitute an API-breaking change, and this approach was previously rejected in PR review. Do not flag or suggest replacing this string-based detection with `errors.As` / typed Temporal error handling.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: thossain-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 257
File: api/pkg/api/handler/vpcpeering.go:203-262
Timestamp: 2026-03-18T01:37:21.156Z
Learning: In Go handler code under api/pkg/api/handler, RBAC should not short-circuit on the first matching role (e.g., if isProviderAdmin { ... } else { ... }). For dual-role service accounts (e.g., Latinum), an org may have both ProviderAdmin and TenantAdmin. Evaluate both authorization paths independently, accumulate errors per path, and only reject if all applicable paths fail. Implement a helper that checks each role path and collects errors, then decide based on whether any path grants access. Add tests covering mixed-role scenarios and ensure CI gates enforce this behavior across all relevant handlers in this directory.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:09:55.214Z
Learning: In this codebase, the database interface is intentionally designed to persist the exact value it is given (including values derived from protobuf types, e.g., enum `.String()`). Repo-owned DB constants (e.g., `cdbm.VpcRoutingProfileTypeExternal`) are published for external callers/consumers, not as an indirection layer that the handler should use. Therefore, when reviewing this repository’s handlers, do not flag as a coupling issue the pattern of persisting a protobuf-derived string value directly via calls like `cdb.GetStrPtr(routingProfileTypeValue.String())`, as this is expected behavior rather than an accidental dependency on those string representations.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:22:33.123Z
Learning: Do not flag as an unacceptable coupling the persistence of protobuf enum values using the enum’s `.String()` output (e.g., storing `cdb.GetStrPtr(myProtoEnum.String())`) in inventory/round-trip paths. This codebase stores these `.String()` values because the gRPC back-end inventory process returns the same protobuf-derived string format, so DB round-trips must preserve the exact representation. If and when these inventory values change to plain strings with no backing proto enum, reassess the storage approach, but until then treat proto-enum `.String()` persistence as intentional and compatible.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 367
File: api/pkg/api/handler/machineinstancetype.go:593-606
Timestamp: 2026-04-10T16:39:25.404Z
Learning: In this codebase, Machine IDs are NOT UUIDs, while the deprecated Machine/InstanceType association IDs ARE UUIDs. When reviewing code that parses/validates incoming machine-related identifiers, treat `uuid.Parse` as a reliable discriminator between the deprecated UUID-based association IDs and the non-UUID Machine IDs. Do not raise/flag potential ambiguity between these two ID namespaces when a `uuid.Parse` check is used to separate them.

Learnt from: parmani-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 372
File: api/pkg/api/handler/fabric.go:137-143
Timestamp: 2026-04-15T23:49:00.903Z
Learning: In this repository, `cdb.ErrDoesNotExist` (from `db/pkg/db`) is a DB-layer sentinel error that is never wrapped. Therefore, in Go code that calls DAO/DB methods (e.g., `TenantSiteDAO.GetByTenantIDAndSiteID`), direct equality checks like `err == cdb.ErrDoesNotExist` / `err != cdb.ErrDoesNotExist` are intentional and safe. Do not flag these comparisons as robustness issues and do not suggest replacing them with `errors.Is()` for this specific sentinel.

Learnt from: thossain-nv
Repo: NVIDIA/infra-controller-rest PR: 462
File: api/pkg/api/handler/expectedswitch.go:158-214
Timestamp: 2026-05-01T21:21:59.704Z
Learning: For Expected* handler code under api/pkg/api/handler/, do NOT flag the pattern of executing Temporal workflows via common.ExecuteSyncWorkflow inside cdb.WithTx / cdb.WithTxResult transaction closures as a critical architectural issue. This codebase intentionally accepts the risk of DB commit failure after a successful Temporal call potentially diverging Site/Cloud state, based on: (1) DB transaction errors are rare, (2) certain Expected* inventory reconciliation can self-heal a Cloud rollback after a Site-side create, and (3) for deletes, IsMissingOnSite enables safe retry.

Learnt from: chet
Repo: NVIDIA/infra-controller-rest PR: 494
File: api/pkg/api/handler/nvlinklogicalpartition.go:210-215
Timestamp: 2026-05-07T06:30:32.433Z
Learning: In handler files under `api/pkg/api/handler/` that are still in the team’s mid-migration state toward the `WithTx` pattern (i.e., they follow the known interim flow where `timeoutResp != nil` is checked before evaluating the transaction `err` from `cdb.WithTx` / `cdb.WithTxResult`), do not raise or flag the `timeoutResp`-ordering issue. The intended follow-up in a later pass is to surface transaction errors before handling timeout/`TerminateWorkflowOnTimeOut` for consistency; until then, treat this ordering as intentional for these in-progress handlers.

@github-actions
Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 66 4 34 18 2 8
nico-nsm 82 2 28 43 9 0
nico-psm 67 4 35 18 2 8
nico-rest-api 100 6 53 30 3 8
nico-rest-cert-manager 65 4 34 18 1 8
nico-rest-db 66 4 34 18 2 8
nico-rest-site-agent 65 4 34 18 1 8
nico-rest-site-manager 65 4 34 18 1 8
nico-rest-workflow 67 4 35 18 2 8
TOTAL 643 36 321 199 23 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant