Skip to content

Fix GAgent durable terminal completion#657

Open
louis4li wants to merge 4 commits into
devfrom
fix/2026-05-14_gagent-durable-terminal-completion
Open

Fix GAgent durable terminal completion#657
louis4li wants to merge 4 commits into
devfrom
fix/2026-05-14_gagent-durable-terminal-completion

Conversation

@louis4li
Copy link
Copy Markdown
Contributor

Summary

Fixes #370 by replacing the placeholder durable completion resolvers for GAgent draft-run and approval interactions with a committed-fact -> projection readmodel -> query-port path.

This PR adds a GAgent run terminal readmodel and durable materialization scope that observes committed RoleChatSessionCompletedEvent facts, then maps the resulting terminal snapshot back into draft-run / approval durable completion. It also fixes the approval live observation session key so CommandId != CorrelationId still receives the continuation terminal event.

What changed

  • Added GAgentRunTerminalSnapshot, GAgentRunTerminalStatus, GAgentRunTerminalInteractionKind, IGAgentRunTerminalQueryPort, and IGAgentRunTerminalProjectionPort.
  • Added GAgentRunTerminalReadModel, metadata provider, projector, query reader, and durable projection activation port.
  • Activated terminal durable materialization before draft-run / approval dispatch.
  • Updated draft-run and approval durable resolvers to query by actorId + correlationId, then fallback to actorId + sessionId where applicable.
  • Extended approval receipts with SessionId for fallback lookup.
  • Persisted approval failure terminal facts for denied / timeout / remote failure paths so missed-live-event recovery has a committed source.
  • Made durable materialization release session-aware through IProjectionSessionScopedMaterializationContext, so session/correlation-scoped terminal materialization can be released with the same key used for activation.
  • Added design notes under docs/history/2026-05/ documenting the architecture and follow-up decisions.

Explicitly deferred

Two review points are intentionally not included in this PR:

  1. Replacing legacy approval / LLM failure markers with a typed terminal event

    • Current implementation treats RoleChatSessionCompletedEvent.Content markers as a legacy fallback only.
    • A proper typed terminal event or typed status/reason_code/reason_message proto fields would change the committed event contract and should be designed/migrated separately.
    • The PR documents that live-only terminal events and marker strings are not the long-term durable protocol.
  2. Changing ReasonCode from string to enum

    • ReasonCode is currently used for durable readmodel reason recording and completion mapping, not as a stable cross-module control-flow or filtering contract.
    • Keeping it as string avoids freezing an enum while legacy marker compatibility still exists.
    • If terminal reason later becomes API/SDK/filter/control-flow surface, it should move together with the typed terminal event work to proto enum / typed code.

Validation

Previously run on this branch:

dotnet test test/Aevatar.CQRS.Projection.Core.Tests/Aevatar.CQRS.Projection.Core.Tests.csproj --nologo --filter "ProjectionRuntimeRegistrationTests"
dotnet test test/Aevatar.GAgentService.Tests/Aevatar.GAgentService.Tests.csproj --nologo --filter "GAgentDraftRunInteraction|GAgentApprovalInteraction|GAgentRunTerminal"
dotnet test test/Aevatar.GAgentService.Integration.Tests/Aevatar.GAgentService.Integration.Tests.csproj --nologo --filter "ScopeServiceEndpointsStreamTests|GAgentServiceHostingServiceCollectionExtensionsTests"
dotnet test test/Aevatar.AI.Tests/Aevatar.AI.Tests.csproj --nologo --filter "RoleGAgentStateCoverageTests"
bash tools/ci/test_stability_guards.sh
bash tools/ci/query_projection_priming_guard.sh
bash tools/ci/projection_state_version_guard.sh
bash tools/ci/projection_state_mirror_current_state_guard.sh
bash tools/ci/projection_route_mapping_guard.sh
git diff --check
dotnet build aevatar.slnx --nologo

Also started the GAgent service locally with in-memory projection/document/graph providers and auth disabled, then exercised related scope service endpoints. The RoleGAgent draft-run failure path committed RoleChatSessionCompletedEvent and wrote GAgentRunTerminalReadModel successfully.

Notes

Full build succeeds with existing warnings from NuGet source configuration, vulnerability advisories, obsolete APIs, complexity, and nullability warnings.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 77.03549% with 110 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.15%. Comparing base (e77e3e8) to head (16eccf9).

Files with missing lines Patch % Lines
src/Aevatar.AI.Core/RoleGAgent.cs 66.19% 21 Missing and 3 partials ⚠️
...lication/ScopeGAgents/GAgentApprovalInteraction.cs 78.75% 11 Missing and 6 partials ⚠️
...lication/ScopeGAgents/GAgentDraftRunInteraction.cs 79.01% 11 Missing and 6 partials ⚠️
...rojection/Projectors/GAgentRunTerminalProjector.cs 79.26% 7 Missing and 10 partials ⚠️
...n/Orchestration/GAgentRunTerminalProjectionPort.cs 73.91% 6 Missing and 6 partials ⚠️
...Projection/Queries/GAgentRunTerminalQueryReader.cs 86.36% 4 Missing and 5 partials ⚠️
...DependencyInjection/ServiceCollectionExtensions.cs 53.33% 7 Missing ⚠️
...stractions/ScopeGAgents/GAgentRunTerminalModels.cs 63.63% 4 Missing ⚠️
.../ReadModels/ServiceProjectionReadModels.Partial.cs 50.00% 2 Missing ⚠️
...ion/Contexts/GAgentRunTerminalProjectionContext.cs 80.00% 1 Missing ⚠️
@@            Coverage Diff             @@
##              dev     #657      +/-   ##
==========================================
+ Coverage   72.13%   72.15%   +0.02%     
==========================================
  Files        1261     1267       +6     
  Lines       91065    91518     +453     
  Branches    11924    11979      +55     
==========================================
+ Hits        65691    66039     +348     
- Misses      20675    20746      +71     
- Partials     4699     4733      +34     
Flag Coverage Δ
ci 72.15% <77.03%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...on/ProjectionMaterializationRuntimeRegistration.cs 100.00% <100.00%> (ø)
....Abstractions/ScopeGAgents/GAgentDraftRunModels.cs 92.50% <100.00%> (+0.39%) ⬆️
...DependencyInjection/ServiceCollectionExtensions.cs 96.95% <100.00%> (+0.05%) ⬆️
...data/GAgentRunTerminalReadModelMetadataProvider.cs 100.00% <100.00%> (ø)
...ion/Contexts/GAgentRunTerminalProjectionContext.cs 80.00% <80.00%> (ø)
.../ReadModels/ServiceProjectionReadModels.Partial.cs 97.18% <50.00%> (-2.82%) ⬇️
...stractions/ScopeGAgents/GAgentRunTerminalModels.cs 63.63% <63.63%> (ø)
...DependencyInjection/ServiceCollectionExtensions.cs 67.33% <53.33%> (-1.56%) ⬇️
...Projection/Queries/GAgentRunTerminalQueryReader.cs 86.36% <86.36%> (ø)
...n/Orchestration/GAgentRunTerminalProjectionPort.cs 73.91% <73.91%> (ø)
... and 4 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@louis4li louis4li marked this pull request as ready for review May 15, 2026 03:11
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 274df47079

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +455 to +456
if (snapshot == null && !string.IsNullOrWhiteSpace(receipt.SessionId))
snapshot = await _queryPort.GetBySessionIdAsync(receipt.ActorId, receipt.SessionId, ct);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard session fallback with the active correlation

When correlation lookup misses, this resolver accepts the first snapshot for actorId + sessionId without checking that the returned snapshot belongs to the current receipt correlation. If the same session has an older terminal document (common when users reuse a session/thread), a transient miss on the new correlation document will cause this command to be marked terminal from stale data. This can produce incorrect completion status for the wrong run; the same fallback pattern is also present in GAgentApprovalDurableCompletionResolver.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in aee3b13. The durable resolvers now only accept correlation or session fallback snapshots when the snapshot matches the active receipt actor/correlation and the expected interaction kind (draft-run vs approval), so reused sessions cannot resolve from stale terminal documents. Added unit coverage for stale correlation, wrong interaction kind, and valid fallback for both draft-run and approval.

Only accept session fallback snapshots when they match the active receipt correlation and interaction kind, preventing reused sessions from resolving with stale terminal state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LastEventId = eventId,
SessionId = completed.SessionId,
CorrelationId = correlationId,
InteractionKind = (int)ResolveInteractionKind(completed.SessionId, correlationId),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Consensus: 3 models] severity=major, category=bug

ResolveInteractionKind infers Approval whenever sessionId != correlationId, but GAgentDraftRunCommandTargetBinder.ResolveSessionId returns command.SessionId.Trim() when the caller supplies one — so any draft-run with an explicit SessionId distinct from CorrelationId will be written into the readmodel as Approval. This makes InteractionKind unreliable for any downstream consumer (filtering, dashboards, future typed terminal-event mapping).

Fix: carry the interaction kind through the projection scope/context (e.g. add InteractionKind to GAgentRunTerminalProjectionContext from the activation site at GAgentDraftRunInteraction.cs / GAgentApprovalInteraction.cs) and read it from the context here, instead of inferring from session/correlation equality.

Per-model verbatim
  • codex: ResolveInteractionKind infers approval whenever SessionId != CorrelationId, but draft-run supports an explicit SessionId, so those draft-run terminal snapshots will be mislabeled as Approval. Carry the interaction kind through the activation context/port or store separate projection kinds instead of deriving it from the session/correlation equality.
  • gemini: The ResolveInteractionKind logic assumes DraftRun always has SessionId == CorrelationId. If a user provides a custom SessionId in the GAgentDraftRunCommand, this will incorrectly label the interaction as Approval. While non-critical for completion logic, it makes the InteractionKind field in the read model potentially misleading.
  • kimi: ResolveInteractionKind uses the heuristic sessionId == correlationId to distinguish DraftRun from Approval. A draft-run with an explicit SessionId (different from CorrelationId) would be misclassified as Approval. Consider documenting this assumption or deriving the kind from explicit event metadata when a typed terminal event is introduced.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 16eccf9. The terminal projection now carries interaction kind explicitly through the activation context using separate projection kinds for draft-run and approval, and the projector writes InteractionKind from context instead of deriving it from session/correlation equality. Added regression coverage for draft-run with explicit SessionId != CorrelationId.

CancellationToken ct = default);
}

public interface IGAgentRunTerminalProjectionPort
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Consensus: 2 models] severity=major, category=arch

IGAgentRunTerminalProjectionPort only exposes EnsureProjectionAsync (correlation-scoped) with no corresponding ReleaseProjectionAsync / lease return. Each draft-run / approval command currently activates a new correlation-keyed durable materialization scope (GAgentApprovalInteraction.cs:181, GAgentDraftRunInteraction.cs:236) and never releases it. Since correlation is per-command, these scopes accumulate until Orleans deactivation — defeating the point of the new IProjectionSessionScopedMaterializationContext release-aware design described in the PR summary.

Fix: return a lease (analogous to IGAgentDraftRunProjectionLease) from EnsureProjectionAsync, and have the interaction cleanup path (GAgentApprovalCommandTarget.ReleaseAsync / draft-run equivalent) release it. Alternatively, make this an actor-scoped (not correlation-scoped) materialization.

Per-model verbatim
  • codex (GAgentRunTerminalProjectionPort.cs:33): This activates a durable materialization scope keyed by correlation id, then discards the lease and exposes no release path. Because draft-run/approval create a new correlation per command, these projection actors/subscriptions can accumulate without cleanup; return a terminal projection lease from the port and release it from the interaction cleanup path, or make this an actor-scoped materialization instead of a per-correlation scope.
  • gemini (GAgentRunTerminalModels.cs:43): The interface is missing a corresponding ReleaseProjectionAsync or ReleaseIfIdleAsync method. Given the PR summary explicitly mentions making durable materialization 'release session-aware', the port should expose this capability to allow the interaction service to clean up command-scoped materializers.
  • gemini (GAgentApprovalInteraction.cs:157): The terminal projection is ensured here but never released. Since these materialization actors are now partitioned by CorrelationId (one per command), they will leak until Orleans deactivates them. The GAgentApprovalCommandTarget should hold a reference to the IGAgentRunTerminalProjectionPort and call a (yet to be exposed) release method in its ReleaseAsync implementation, similar to how the live _projectionPort is handled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 16eccf9. IGAgentRunTerminalProjectionPort now returns an IGAgentRunTerminalProjectionLease and exposes ReleaseProjectionAsync. Draft-run and approval targets retain the terminal lease and release it during cleanup, including activation-failure cleanup after terminal materialization succeeds.

{
var snapshot = await _queryPort.GetByCorrelationIdAsync(receipt.ActorId, receipt.CorrelationId, ct);
if (snapshot == null && !string.IsNullOrWhiteSpace(receipt.SessionId))
snapshot = await _queryPort.GetBySessionIdAsync(receipt.ActorId, receipt.SessionId, ct);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[codex] severity=major, category=bug

The if (snapshot == null && !string.IsNullOrWhiteSpace(receipt.SessionId)) fallback accepts whatever GetBySessionIdAsync returns without checking that the returned snapshot's CorrelationId matches receipt.CorrelationId. If a session/thread is reused (which is the whole point of command.SessionId) and the correlation document is temporarily missing or evicted, this can complete the current command from an older terminal fact — silently delivering stale completion status.

Same pattern exists in GAgentApprovalDurableCompletionResolver (GAgentApprovalInteraction.cs:343-345).

Fix: only accept the session-fallback snapshot when its CorrelationId == receipt.CorrelationId, otherwise return Incomplete:

if (snapshot != null &&
    !string.Equals(snapshot.CorrelationId, receipt.CorrelationId, StringComparison.Ordinal))
    snapshot = null;

Apply in both resolvers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in aee3b13. The fallback path now validates actor, active correlation, and expected interaction kind before accepting a session lookup result, so reused sessions cannot resolve the current command from stale terminal documents.

@@ -151,6 +159,12 @@ public async Task HandleToolApprovalDecision(ToolApprovalDecisionEvent evt)
else
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[kimi] severity=major, category=bug

The approval-denied else branch calls PersistApprovalTerminalFailureAsync without a try / catch. If this throws after ClearPendingApprovalEvent is already persisted, the actor event fails and retries; on retry State.PendingApproval is already null so the method returns early, permanently losing the terminal failure fact — exactly what the durable readmodel needs.

The same gap exists in two more call sites in this file:

  • HandleToolApprovalTimeout no-remote-handler branch (around line 195)
  • HandleToolApprovalTimeout remote-failed/denied/timed-out branch (around line 249)

Only the exception path at lines 143-150 wraps it in try { … } catch { /* best effort */ }.

Fix: either wrap each call site in best-effort try/catch (matching the pattern at lines 143-150), or move the try/catch inside PersistApprovalTerminalFailureAsync itself so every caller is safe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 16eccf9. Approval denied, no-remote timeout, and remote denied/failed/timed-out paths now persist the terminal failure fact before clearing PendingApproval, preserving the durable completion source if terminal persistence fails.

@@ -0,0 +1,227 @@
using Aevatar.AI.Abstractions;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[mimo-v2.5-pro] severity=major, category=test

Missing test for monotonic version guard. CLAUDE.md projection rules require: "旧 StateVersion 不覆盖新 readmodel". Add a test that sends two committed events with the same actorId + correlationId but a descending stateVersion, then asserts the readmodel retains the higher version (or that the projector / IProjectionWriteDispatcher rejects the stale write).

Without this test, a future change to _writeDispatcher.UpsertAsync semantics (or a parallel write path) could regress the monotonic guarantee without anyone noticing — the rest of the test suite only exercises forward-version cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 16eccf9. Added a monotonic StateVersion regression test for GAgentRunTerminalProjector using the projection write evaluator semantics so an older committed event cannot overwrite a newer terminal readmodel.

Comment thread src/Aevatar.AI.Core/RoleGAgent.cs Outdated
try { await PersistDomainEventAsync(new ClearPendingApprovalEvent { RequestId = pending.RequestId }); }
catch { /* best effort */ }
try
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[gemini] severity=nit, category=other

This try { … } catch { /* best effort */ } swallows persistence failures silently. If PersistApprovalTerminalFailureAsync fails here, the durable resolver will stay Incomplete indefinitely with no operator signal. Consider logging at warning level inside the catch (include pending.SessionId, pending.RequestId, the reason code, and ex.Message) so this is observable.

Same applies to the symmetric catch the kimi comment proposes adding at the denied / timeout sites — when those wraps are added, give them logging too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 16eccf9. The approval continuation failure path now logs terminal failure persistence errors with request id, session id, and reason code, and logs pending-clear failures after the terminal fact has been persisted.

Carry terminal interaction kind explicitly through materialization, release correlation-scoped leases, and persist approval terminal failures before clearing pending state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@louis4li
Copy link
Copy Markdown
Contributor Author

Addressed the review feedback in 16eccf9.

Summary:

  • Carry terminal interaction kind explicitly through projection activation/context and use separate terminal projection kinds for draft-run vs approval.
  • Return/release terminal projection leases from draft-run and approval cleanup paths.
  • Persist approval terminal failure facts before clearing pending approval for denied/timeout/remote-failure paths, with logged best-effort handling for continuation failures.
  • Added regression coverage for explicit draft-run session ids, terminal lease cleanup, approval failure ordering, and monotonic StateVersion writes.

Verification:

  • dotnet test test/Aevatar.GAgentService.Tests/Aevatar.GAgentService.Tests.csproj --no-restore --nologo -v minimal: 381 passed
  • dotnet test test/Aevatar.AI.Tests/Aevatar.AI.Tests.csproj --no-restore --filter "RoleGAgent" --nologo -v minimal: 34 passed
  • tools/ci/query_projection_priming_guard.sh: passed
  • tools/ci/projection_state_version_guard.sh: passed

Note: full solution build is still blocked by existing Aevatar.Studio.Hosting missing AI abstraction references, unrelated to these changes.

@eanzhao
Copy link
Copy Markdown
Contributor

eanzhao commented May 15, 2026

Fix-check follow-up (multi-model)

Re-ran the 6-model review against the post-fix HEAD. All 7 inline comments are majority-resolved (5/6 or 6/6) — but three threads have credible dissenting verdicts worth a look before this merges.

# thread majority dissent
1 RoleGAgent.cs:144 (silent catch logging) ✅ resolved 5/6 mimo → partial
2 RoleGAgent.cs:152 (denied/timeout try/catch) ✅ resolved 5/6 mimo → partial
3 GAgentRunTerminalModels.cs:52 (release lease) ✅ resolved 5/6 codex → partial (verified)
4 GAgentDraftRunInteraction.cs:456 (codex-bot session guard) ✅ resolved 6/6
5 GAgentDraftRunInteraction.cs:456 (session guard) ✅ resolved 6/6
6 GAgentRunTerminalProjector.cs:69 (InteractionKind) ✅ resolved 6/6
7 GAgentRunTerminalProjectorTests.cs:1 (monotonic test) ✅ resolved 6/6

Dissents

#3 — Terminal projection lease is never released on success (codex; verified)

The fix adds IGAgentRunTerminalProjectionLease + ReleaseProjectionAsync, and releases the lease in two places:

  • CleanupAfterDispatchFailureAsync on both GAgentDraftRunCommandTarget and GAgentApprovalCommandTarget
  • The binder's exception catch block

But ICommandDispatchCleanupAware.CleanupAfterDispatchFailureAsync is only invoked from DefaultCommandDispatchPipeline.DispatchPreparedAsync's catch block (src/Aevatar.CQRS.Core/Commands/DefaultCommandDispatchPipeline.cs:90). On successful dispatch, no cleanup hook fires, so the bound TerminalProjectionLease is never released for successful commands. Since each command activates a new correlation-scoped scope, the original accumulation concern persists for the happy path.

Possible directions:

  • Add a separate success-path cleanup hook (e.g. CleanupAfterDispatchSuccessAsync) and release there.
  • Release the terminal lease inline once the durable resolver observes a terminal snapshot (since at that point the readmodel is already materialized).
  • Switch to actor-scoped (not correlation-scoped) terminal materialization, so there's a bounded number of scopes regardless of command volume.

#1 & #2 — Denied/timeout branches use the non-Try* wrapper (mimo)

Literally accurate, but the author chose an ordering-based fix (PersistApprovalTerminalFailureThenClearPendingAsync persists terminal before clearing pending), which prevents the data-loss scenario the original comment described: if terminal-persist throws, clear-pending hasn't run, so retry re-enters with PendingApproval still set. 5/6 reviewers — including the original commenter (kimi) — judged this adequate. Worth a sanity check, but probably fine as a design choice.

Reviewer panel

All 6 (deepseek-v4-pro, glm-5.1, mimo-v2.5-pro, kimi, gemini, codex) returned 7/7 valid verdicts this round. No timeouts.

@louis4li
Copy link
Copy Markdown
Contributor Author

I checked the #3 dissent. I think this is already covered by the interaction lifecycle, so I do not think another code change is needed here.

DefaultCommandDispatchPipeline only calls CleanupAfterDispatchFailureAsync on dispatch failure, but draft-run / approval do not rely on that hook for the successful interaction path. Both command targets also implement ICommandInteractionCleanupTarget, and their ReleaseAfterInteractionAsync(...) calls the same ReleaseAsync(...) path that releases TerminalProjectionLease.

DefaultCommandInteractionService.ExecuteAsync(...) invokes target.ReleaseAfterInteractionAsync(...) from finally after live completion, durable fallback, or interaction failure. The GAgent draft-run / approval services are registered through DefaultCommandInteractionService, and the endpoints call interactionService.ExecuteAsync(...), so successful commands do release the terminal projection lease.

So I read CleanupAfterDispatchFailureAsync as the pre-interaction dispatch-failure fallback; the normal successful command cleanup is handled by the interaction service finally block.

{
try
{
await _terminalProjectionPort.ReleaseProjectionAsync(terminalProjectionLease, ct);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a draft run enters pending approval, the live stream has already emitted TextMessageEnd, so interaction cleanup treats this interaction as observedCompleted=true and releases the TerminalProjectionLease. However, the RoleChatSessionCompletedEvent failure fact for approval local timeout, remote denied, or remote timeout is a later actor event and can be committed after the projection scope has already been Released. At that point, ProjectionScopeGAgentBase.HandleObservedEnvelopeAsync returns immediately, so GAgentRunTerminalProjector never materializes a GAgentRunTerminalReadModel. The actor has committed the terminal failure fact, but the durable read model layer cannot observe it, and recovery after a missed live terminal event can read an incomplete or stale state.

legacyMessage);
}

if (content.StartsWith(LegacyLlmFailedPrefix, StringComparison.Ordinal))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal provider / LLM exception path in RoleGAgent writes LLM request failed [tools=...]: ... when useWorkflowFailureMarker=false, but GAgentRunTerminalProjector.ResolveTerminal only recognizes [[AEVATAR_LLM_ERROR]] and the exact prefix LLM request failed:. Failure content that includes [tools=...] falls through to the default branch and is materialized as TextMessageCompleted, so the durable resolver later recovers a real failure as completed. This path persists provider / LLM exceptions as successful terminal states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement durable terminal completion resolution for draft-run and approval interactions

3 participants