Callback for workflow update support by Quinn-With-Two-Ns · Pull Request #9614 · temporalio/temporal

Quinn-With-Two-Ns · 2026-03-21T21:50:13Z

What changed?

Added support for Nexus workflow update completion callbacks via CHASM. This allows a Nexus caller to be notified when a workflow update completes by attaching completion callbacks to the update request.

Why?

Nexus operations that target workflow updates need a way to receive completion notifications. Without this, a Nexus caller that sends an update has no async mechanism to learn when the update finishes. Completion callbacks enable the same async notification pattern that already exists for workflow-level Nexus operations.

How did you test it?

Potential risks

Touches speculative workflow updates, they are always hard to reason about. Tried to compensate with lots of test coverage.

Note: Needs this API PR https://github.com/temporalio/api/pull/742/changes

Note

High Risk
Touches workflow update state transitions, mutable state/history event application, and callback scheduling paths (including retry/continue-as-new), which can affect correctness of update outcomes and callback delivery. Also adds a go.mod replace for go.temporal.io/api, increasing dependency and compatibility risk.

Overview
Adds CHASM-backed completion callbacks for workflow updates: update requests can register Nexus completion_callbacks, persist them via WorkflowExecutionOptionsUpdated/update events, and schedule them when the update completes, the workflow closes, or the run continues.

Introduces a new per-update CHASM component (WorkflowUpdate + UpdateState proto) with opt-in self-cleanup of terminal callbacks, plus new mutable-state support to fetch update completion data (GetNexusUpdateCompletion) and to fire callbacks on update completion/rejection and on close paths.

Extends update handling to validate request_id when callbacks are present, buffer callbacks pre-acceptance, persist/dedup them on acceptance using WorkflowExecutionOptionsUpdatedEventAttributes.WorkflowUpdateOptions, and return clearer response backlinks (workflow-event link for accepted updates, workflow link for validator rejections).

^{Reviewed by Cursor Bugbot for commit 579442b. Bugbot is set up for automated code reviews on this repo. Configure here.}

bergundy

I think we need just one more round here. For when updates are already completed, let's make sure to generate the new link type we discussed server-side.

bergundy · 2026-03-23T23:17:48Z

 func (l *Library) Components() []*chasm.RegistrableComponent {
 	return []*chasm.RegistrableComponent{
 		chasm.NewRegistrableComponent[*Workflow](chasm.WorkflowComponentName),
+		chasm.NewRegistrableComponent[*WorkflowUpdate](chasm.WorkflowUpdateComponentName),


Given that workflow update is tightly coupled to workflows, it makes total sense to put them in the same library.

bergundy · 2026-03-23T23:49:43Z

+	*workflowpb.UpdateState
+
+	// MSPointer is a special in-memory field for accessing the underlying mutable state.
+	chasm.MSPointer


This was only supposed to be embedded in the top level Workflow component but I can see why you'd want to access it here. No strong opinion because either way this would be a workaround. I wonder though if you need to embed this or if it'd be better to make it a named field.

It was embed in the workflow component so I made it embed here

if it's not embedded then it would also need to be an exported field otherwise CHASM tree deserialization will not work. Probably to keep similar convention embedding is ok here

bergundy · 2026-03-23T23:54:23Z

 	)
+	MaxCallbacksPerUpdateID = NewNamespaceIntSetting(
+		"system.maxCallbacksPerUpdateID",
+		32,


I think limiting all of the workflow callbacks, regardless of what component they're attached to makes more sense than a per component limit due to the fact that the entire tree needs to be loaded into memory when mutable state is accessed today.

I also limited all workflow callbacks as well. I added this limit as well to keep one update from using up all the callbacks limit on a workflow.

stephanos

Only made it half-way through so far; but figured I can send my first review comments now.

stephanos · 2026-03-25T00:47:37Z

 	links []*commonpb.Link,
 	identity string,
 	priority *commonpb.Priority,
+	workflowUpdateOptions map[string]*historypb.WorkflowExecutionOptionsUpdatedEventAttributes_WorkflowUpdateOptionsUpdate,


I know it's not wrong, but ... WorkflowUpdateOptionsUpdate 😬

(non-blocking; just noticing)

Yeah I agree

long-nt-tran · 2026-04-27T15:45:10Z

+	// - The event will be written atomically with acceptance
+	// If the Update struct is lost (registry cleared), the abort mechanism fires
+	// registryClearedErr on the caller's future, prompting an immediate retry.
+	if u.state == stateAdmitted || u.state == stateSent {


added handling for stateAdmitted, should be same as stateSent but returns false, nil since IIUC caller still needs to create the speculative WFT at this stage

long-nt-tran · 2026-04-27T16:34:40Z

Made some updates to bring this to latest main, I squashed the base PR to first commit and group each type of change into a subsequent commit for ease of review.

Only logical changes are on on the top commit -- handling stateAdmitted and flushing callbacks to CHASM store before rejecting, and added some more unit tests to test nexus cases + backlinking.

cc @bergundy @Quinn-With-Two-Ns @stephanos

## What changed? When we set Nexus callback URL in test_env.go, the dynamic config override is still tied to the test's lifetime, not the cluster's lifetime, so a subsequent test that reuse this cluster will not have that override. Moving the override to onebox.go (similar pattern to #9918) so this default lives for the lifetime of the cluster. ## Why? Ran into issue with task token not set in #9614, this solves it. Breaking the fix in a separate PR for ease of review + checking this in first. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)

## What changed? Added a `createExternalNexusServer(...)` which sets up an external Nexus endpoint with user-provided handler and listens on a provided address. This is used in nexus_workflow_test.go and will be used more in #9614 Opportunistically did a couple more drive-by refactors/consistency fixes, specifically: * Force user to provide `ctx` into the endpoint creation functions instead of making a new `ctx` * Use `env.Context()` instead of `testcore.NewContext()` in all suites that I touched here ## Why? Pulling changes out of #9614 into targeted PRs to reduce load on reviewers. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)

stephanos · 2026-04-27T16:25:49Z

-	// TODO (alex-update): This method is noop because we don't currently write rejections to the history.
-	return nil
+func (ms *MutableStateImpl) RejectWorkflowExecutionUpdate(updateID string, wfFailure *failurepb.Failure) error {
+	if !ms.chasmCallbacksEnabled() {


elsewhere we use chasmEnabled; why use chasmCallbacksEnabled here? if it's intentional, a comment would be helpful.

stephanos · 2026-04-28T02:31:36Z

+// but update callbacks must fire now because the update was aborted on the old run.
+func (w *Workflow) ProcessAllUpdateCloseCallbacks(ctx chasm.MutableContext) error {
+	for _, updateField := range w.Updates {
+		if err := callback.ScheduleStandbyCallbacks(ctx, updateField.Get(ctx).Callbacks); err != nil {


Is my understanding correct that once this returns an error; the entire attempt to schedule callbacks is aborted and retried from the top? We wouldn't want partial results.

right, I think it's better to abort here and retry -- seems like it'd be harder to reason about if we only partially succeed some updates

stephanos · 2026-05-05T00:11:54Z

+	if len(wf.Updates) == 0 {
+		return nil
+	}


Is this a meaningful perf optimization?

seems like for now it still has some perf implication based on offline discussion -- added TODOs to cleanup this read-to-write upgrade when it's CHASM natively detects mutation

stephanos · 2026-05-05T16:29:48Z

+
+// flushPendingCallbacks writes one WorkflowExecutionOptionsUpdatedEvent per
+// buffered AttachCallbacks callback, skipping any whose requestID is already persisted.
+// Called from onAcceptanceMsg after the acceptance event has been written.


This line makes it sound like it's only called from onAcceptanceMsg but that's not true.

replying for posterity from offline chat -- it should only be called from onAcceptanceMsg (not on rejection)

stephanos · 2026-05-05T22:57:15Z

+	"go.temporal.io/server/common/nexus/nexusrpc"
+)
+
+type WorkflowUpdate struct {


I think it's not yet clearly documented anywhere that the semantics of a rejected Update with vs without callbacks are different now. If an update has callbacks it will incur a write on rejection now where it didn't before.

I'd update docs/architecture/workflow-update.md and add comments across the update package to clarify that (e.g. pendingCallbacks, onRejectionMsg).

If an update has callbacks it will incur a write on rejection now where it didn't before.

That shouldn't be true, the callback is not used for rejection normally (speculative update case). It should only used if the update was durably admitted

I see onRejectionMsg has eventStore EventStore and if the update is in "sent" or "admitted" (ie before "acceptance") it invokes flushPendingCallbacks which - if there are callbacks - invokes AddWorkflowExecutionOptionsUpdatedEvent which AFAICT is a write? And then RejectWorkflowExecutionUpdate also adds another one, no?

modified the workflow-update.md to capture the expected buffering -> persist behavior for updates (tldr: buffered when they get admitted/sent, persisted on acceptance, but not on rejection). Essentially the same as before, with the CHASM persistence so we can attach completion callbacks

long-nt-tran · 2026-05-11T17:35:08Z

+type CallbackParent interface {
+	RemoveCallback(ctx chasm.MutableContext, c *Callback)
+}


@stephanos LMK if this is an idiomatic pattern (self-cleanup of callbacks via optional interface)

AFAIK there's no established pattern for this yet. cc @yycptt

Squashed these commits, left for posterity: - Add Nexus Workflow Update - Update from rebase - Fix sent state - Cleanup - Fix lint - Fix more CI - fix - Review clean up - Try suggestions from the review skill - Fix some tests - Add TODO for rejected event - Remove .omc from gitignore - Respond to PR comments - Add NS Capability for this feature - Respond to PR comments - Update API

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Reviewed by Cursor Bugbot for commit 579442b. Configure here.}

cursor · 2026-05-11T21:04:25Z

+			maxCallbacksPerUpdateID,
+			updateID,
+			currentCallbackCount,
+		)


Duplicate callbacks can exceed configured limits

Medium Severity

AddUpdateCompletionCallbacks enforces limits before deduplicating existing callback IDs. When the same callbacks are re-applied for an update (for example admission plus acceptance replay paths), valid idempotent re-registration can fail with FailedPrecondition even though no new callbacks would be added.

Additional Locations (2)

chasm/lib/workflow/workflow.go#L245-L248

service/history/workflow/mutable_state_impl.go#L5621-L5632

^{Reviewed by Cursor Bugbot for commit 579442b. Configure here.}

This seems legit!

right, good spot, it seems a bit clunky to assert this (see countNewCallbacks(...) but maybe better than not

cursor · 2026-05-11T21:04:25Z

+					Reference: &commonpb.Link_WorkflowEvent_RequestIdRef{
+						RequestIdRef: &commonpb.Link_WorkflowEvent_RequestIdReference{
+							RequestId: requestID,
+							EventType: enumspb.EVENT_TYPE_WORKFLOW_EXECUTION_UPDATE_ACCEPTED,


Update backlink may reference wrong request id

Low Severity

The response link uses u.req...RequestId instead of the accepted update’s persisted identifier. For duplicate update_id calls or requests without request_id, Link_WorkflowEvent.RequestIdRef can be empty or mismatched, producing a backlink that does not correspond to the actual accepted event.

^{Reviewed by Cursor Bugbot for commit 579442b. Configure here.}

cursor · 2026-05-11T21:04:25Z

+
+func (u *Update) AcceptedEventID() int64 {
+	return u.acceptedEventID
+}


Unused exported update accessor added

Low Severity

AcceptedEventID is newly exported but has no callers in the repository. This adds dead public surface to update.Update, which increases maintenance burden and can mislead future code into relying on an accessor that currently has no supported use.

^{Reviewed by Cursor Bugbot for commit 579442b. Configure here.}

This connects to my comment about Link_WorkflowEvent_EventRef

cursor · 2026-05-11T21:04:25Z

 		eq(msg.GetMeta().GetUpdateId(), prefix+"meta.update_id", updateID, updateID, msg),
 		notZero(msg.GetInput(), prefix+"input", msg),
 		notZero(msg.GetInput().GetName(), prefix+"input.name", msg),
+		callbacksRequireRequestID(msg),


Rejection failure is not validated

Medium Severity

validateRejectionMsg only checks that the rejection message exists, not that rejection.failure is present. A nil failure is then forwarded through RejectWorkflowExecutionUpdate, so update callbacks lose the rejection payload and can resolve via the wrong completion path.

Additional Locations (1)

service/history/workflow/update/update.go#L725-L729

^{Reviewed by Cursor Bugbot for commit 579442b. Configure here.}

## What changed? Added a `createExternalNexusServer(...)` which sets up an external Nexus endpoint with user-provided handler and listens on a provided address. This is used in nexus_workflow_test.go and will be used more in #9614 Opportunistically did a couple more drive-by refactors/consistency fixes, specifically: * Force user to provide `ctx` into the endpoint creation functions instead of making a new `ctx` * Use `env.Context()` instead of `testcore.NewContext()` in all suites that I touched here ## Why? Pulling changes out of #9614 into targeted PRs to reduce load on reviewers. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)

stephanos · 2026-05-12T15:02:43Z


+	if len(request.GetRequest().GetRequestId()) > wh.config.MaxIDLengthLimit() {
+		return errRequestIDTooLong
+	}


Good catch!

stephanos · 2026-05-12T19:14:34Z

 		return nil, err
 	}
 	resp := u.CreateResponse(u.wfKey, status.Outcome, status.Stage)
+	// Attach a link to the response. For accepted/completed updates, use a WorkflowEvent link


nit: let's give this code block a newline above and below to breath

stephanos · 2026-05-12T19:20:27Z

+					Namespace:  u.req.Request.Namespace,
+					WorkflowId: u.wfKey.WorkflowID,
+					RunId:      u.wfKey.RunID,
+					Reference: &commonpb.Link_WorkflowEvent_RequestIdRef{


I'm new to these link events, but why would't we use Link_WorkflowEvent_EventRef since we have an EventId (ie AcceptedEventID)?

stephanos · 2026-05-12T19:49:49Z

+				}
+				if got := workflowEvent.GetRequestIdRef().GetEventType(); got != enumspb.EVENT_TYPE_WORKFLOW_EXECUTION_UPDATE_ACCEPTED {
+					return nil, nexus.NewHandlerErrorf(nexus.HandlerErrorTypeInternal, "expected event type UPDATE_ACCEPTED, got %v", got)
+				}


Should this verify that it "points to the accepted event"?

stephanos · 2026-05-12T19:54:36Z

+	s.NoError(err)
+
+	// hasUpdateChasmNode reports whether any CHASM node path references the given update ID.
+	hasUpdateChasmNode := func(desc *adminservice.DescribeMutableStateResponse) bool {


I'm not sure about the functional/acceptance tests reaching that deep into the server internals. It couples the test too much to the impl details for me. Maybe this should be a unit/integration test inside the CHASM package instead?

stephanos · 2026-05-12T23:26:14Z

+// from saveResult on terminal transition.
+func (u *WorkflowUpdate) RemoveCallback(ctx chasm.MutableContext, c *callback.Callback) {
+	for id, field := range u.Callbacks {
+		if field.Get(ctx) == c {


side note; I was surprised that this is safe. But it appears to be safe. I didn't think that comparing the pointers here would work in CHASM, but apparently that's how it works. I'm not sure we can rely on that, though?

stephanos · 2026-05-12T23:35:53Z

+func (u *WorkflowUpdate) RemoveCallback(ctx chasm.MutableContext, c *callback.Callback) {
+	for id, field := range u.Callbacks {
+		if field.Get(ctx) == c {
+			delete(u.Callbacks, id)


I'm torn between paying the runtime cost of iteration over this and the fact that the map key is essentially "unrecoverable" as it's based on information that's thrown away again - vs storing the key/part of the key so this becomes an O(1) lookup. I acknowledge that there likely won't be many callbacks here, though. So I suppose runtime loop is fine.

stephanos · 2026-05-12T23:41:19Z

+// update-level scheduling are independent: failure of one does not stop the
+// other; the errors are joined.
+func (w *Workflow) ScheduleCloseCallbacks(ctx chasm.MutableContext) error {
+	wfErr := callback.ScheduleStandbyCallbacks(ctx, w.Callbacks)


nit: maybe a softassert invariant error if the workflow isn't closed?

needed to expose a function from ms_pointer.go to check state but maybe that's ok

stephanos · 2026-05-12T23:43:26Z

+			maxCallbacksPerUpdateID,
+			updateID,
+			currentCallbackCount,
+		)


This seems legit!

stephanos · 2026-05-12T23:45:06Z

+
+func (u *Update) AcceptedEventID() int64 {
+	return u.acceptedEventID
+}


This connects to my comment about Link_WorkflowEvent_EventRef

Quinn-With-Two-Ns commented Mar 21, 2026

View reviewed changes

Comment thread common/dynamicconfig/constants.go Outdated

Quinn-With-Two-Ns commented Mar 22, 2026

View reviewed changes

Comment thread chasm/tree.go

Quinn-With-Two-Ns mentioned this pull request Mar 24, 2026

Add Callbacks and Links to Workflow Update temporalio/api#742

Open

bergundy reviewed Mar 25, 2026

View reviewed changes

stephanos reviewed Mar 25, 2026

View reviewed changes

Quinn-With-Two-Ns requested review from bergundy and stephanos March 26, 2026 22:22

bergundy approved these changes Mar 30, 2026

View reviewed changes

long-nt-tran force-pushed the nexus-workflow-update branch 7 times, most recently from a453230 to 09ac27a Compare April 27, 2026 14:58

long-nt-tran reviewed Apr 27, 2026

View reviewed changes

long-nt-tran force-pushed the nexus-workflow-update branch from 09ac27a to 9de5339 Compare April 27, 2026 16:05

long-nt-tran marked this pull request as ready for review April 27, 2026 16:34

long-nt-tran requested review from a team as code owners April 27, 2026 16:34

cursor Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread go.mod Outdated

Comment thread service/history/workflow/mutable_state_impl.go

Comment thread service/history/workflow/update/update.go

long-nt-tran force-pushed the nexus-workflow-update branch from 9de5339 to 4b0915d Compare April 27, 2026 17:52

cursor Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread service/history/api/updateworkflow/api.go

long-nt-tran force-pushed the nexus-workflow-update branch 2 times, most recently from 8551a4f to 3ae1202 Compare April 27, 2026 20:22

cursor Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread chasm/lib/workflow/workflow.go Outdated

long-nt-tran force-pushed the nexus-workflow-update branch from 3ae1202 to 2ce7339 Compare April 28, 2026 02:13

long-nt-tran reviewed Apr 28, 2026

View reviewed changes

Comment thread tests/testcore/test_cluster_pool.go Outdated

long-nt-tran mentioned this pull request Apr 28, 2026

[tests] Tie nexus callback URL default to cluster lifetime #10102

Merged

5 tasks

long-nt-tran mentioned this pull request Apr 28, 2026

Refactor Nexus endpoint creation for functional tests #10107

Merged

5 tasks

stephanos reviewed May 5, 2026

View reviewed changes

Comment thread service/history/workflow/update/update.go Outdated

stephanos reviewed May 6, 2026

View reviewed changes

stephanos reviewed May 7, 2026

View reviewed changes

Comment thread service/history/workflow/update/update.go Outdated

long-nt-tran force-pushed the nexus-workflow-update branch from 239cb1f to 72b65be Compare May 11, 2026 17:33

long-nt-tran requested a review from a team as a code owner May 11, 2026 17:33

long-nt-tran reviewed May 11, 2026

View reviewed changes

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread service/history/workflow/update/update.go Outdated

Quinn-With-Two-Ns and others added 8 commits May 11, 2026 15:45

Update function interfaces with timeSkippingConfig

d92e8ed

Update Nexus update tests to parallelsuite

e0f80b0

No-op code changes

7b7a046

Update protos

b0dec56

Small fixes to close the loop on persisting/flushing/firing callbacks

2f0070b

PR review feedback

d888fbb

Address CI failure + cursor bugbot comment

4b7757c

long-nt-tran force-pushed the nexus-workflow-update branch from 72b65be to 4b7757c Compare May 11, 2026 19:45

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread service/history/api/updateworkflow/api.go

gofmt

22066ed

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread service/history/api/updateworkflow/api.go

Address cursor bugbot -- verify requestID

579442b

cursor Bot reviewed May 11, 2026

View reviewed changes

long-nt-tran requested a review from stephanos May 12, 2026 14:10

stephanos reviewed May 12, 2026

View reviewed changes

Conversation

Quinn-With-Two-Ns commented Mar 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

Uh oh!

Uh oh!

bergundy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephanos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephanos Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

long-nt-tran commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Quinn-With-Two-Ns commented Mar 21, 2026 •

edited by cursor Bot

Loading

stephanos Mar 25, 2026 •

edited

Loading

Quinn-With-Two-Ns May 7, 2026 •

edited

Loading

stephanos May 7, 2026 •

edited

Loading