Skip to content

feat(temporal): add worker-side payload encryption#2297

Merged
daryllimyt merged 39 commits intomainfrom
feat/temporal-breakglass-codec-server
Apr 21, 2026
Merged

feat(temporal): add worker-side payload encryption#2297
daryllimyt merged 39 commits intomainfrom
feat/temporal-breakglass-codec-server

Conversation

@daryllimyt
Copy link
Copy Markdown
Contributor

@daryllimyt daryllimyt commented Mar 8, 2026

Checklist

  • Read CONTRIBUTING.md.
  • PR title is short and non-generic (see previously merged PRs for examples).
  • PR only implements a single feature or fixes a single bug.
  • Tests passing (uv run pytest tests)?
  • Lint / pre-commits passing (pre-commit run --all-files)?

Description

This PR adds worker-side Temporal payload encryption without relying on an external codec server. When enabled, the Temporal data converter composes compression with AES-GCM encryption, scopes keys by workspace, and keeps decode marker-driven so historical payloads remain readable across configuration changes.

Today's updates refine the rollout shape:

  • Remove the Temporal codec server API route and router tests.
  • Add versioned payload keyring support from environment JSON or AWS Secrets Manager, with cached keyring retrieval and memoized codec construction.
  • Scope encryption configuration and keyring secret access to Temporal services in Docker Compose and Fargate.
  • Harden history, agent memo, and child workflow memo decode paths so failures are logged and tolerated instead of breaking workflow inspection.
  • Update unit coverage for keyring versions, codec caching, converter decode behavior, and memo decode failure handling.

Related Issues

N/A

Screenshots / Recordings

N/A. Backend, Temporal, and infrastructure change only.

Steps to QA

  1. Run uv run pytest tests/unit/test_temporal_codec.py tests/unit/test_dsl_converter.py tests/temporal/test_workflow_timers.py.
  2. Run uv run ruff check tracecat/temporal/codec.py tracecat/dsl/common.py tracecat/config.py tracecat/api/app.py tests/unit/test_temporal_codec.py tests/unit/test_dsl_converter.py tests/temporal/test_workflow_timers.py.
  3. Run uv run basedpyright tracecat/temporal/codec.py tracecat/dsl/common.py tracecat/config.py tracecat/api/app.py.

Summary by cubic

Adds worker-side, fail-open AES‑GCM encryption for Temporal payloads with per-workspace keys and marker-driven decode; no codec server required. Off by default; enable with TEMPORAL__PAYLOAD_ENCRYPTION_ENABLED=true and provide a keyring via TEMPORAL__PAYLOAD_ENCRYPTION_KEYRING or TEMPORAL__PAYLOAD_ENCRYPTION_KEYRING_ARN.

  • New Features

    • Composite codec: optional compression (zstd/gzip/brotli) then AES‑GCM; decode is marker‑driven and always tries both codecs so old payloads stay readable; fail‑open passthrough with minimal error logging; validates workspace scope and nonce; workspace derived from ctx_role.workspace_id (falls back to __global__).
    • Key management: async, versioned keyring from env JSON or AWS Secrets Manager (via asyncio.to_thread); cachetools TTL cache with configurable TTL/max items; memoized codec factory with test reset helpers.
    • Data converter/decoding: shared codec on all decode paths via decode_payloads; uses encoded‑attributes failure converter when encryption is enabled; agent/child memo parsing is async and tolerant.
    • Infra: Terraform/IAM policy for temporal_payload_encryption_keyring_arn; scope Temporal secrets to worker/agent tasks; Docker Compose exposes encryption env vars; new Fargate vars for cache TTL/max items; removed codec server endpoint.
  • Bug Fixes

    • Historical decode reliability: always include both codecs on the decode path regardless of current config; removed init‑time compression algorithm validation and reject invalid algorithms on encode only.
    • Keyring robustness: wrap AWS fetch and binary keyring decode errors, preserve last‑known‑good cache on refresh failures, revalidate derived keys, back off failed refreshes, and add double‑checked locking for safe concurrent cache access.
    • Encode stability: derive the encryption key for encode from a single keyring snapshot to avoid mixed versions during refresh.
    • Workflow failures: decode encoded workflow failure messages and cause chains in history/compact views; switch to async failure parsing.
    • Unreadable payloads: surface a structured UnreadableTemporalPayload placeholder (includes encoding and size) in histories and compact views instead of raising; skip trigger context resolution for these; update client schemas/types to accept the placeholder.
    • Misc: fixed ARN handling in Alembic env secret retrieval; timer tests ignore cooperative yields; platform registry tests use UUID‑based origins.

Written for commit 7b894af. Summary will update on new commits.

Copy link
Copy Markdown
Contributor Author

daryllimyt commented Mar 8, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@daryllimyt daryllimyt force-pushed the feat/temporal-agent-history-encryption branch from c5e5d08 to 35c2788 Compare March 8, 2026 01:24
@daryllimyt daryllimyt force-pushed the feat/temporal-breakglass-codec-server branch 2 times, most recently from 2e221d2 to b356b85 Compare March 8, 2026 01:45
@daryllimyt daryllimyt force-pushed the feat/temporal-agent-history-encryption branch from 35c2788 to 2ad828b Compare March 8, 2026 01:45
@daryllimyt daryllimyt changed the base branch from feat/temporal-agent-history-encryption to main March 18, 2026 18:05
@daryllimyt daryllimyt force-pushed the feat/temporal-breakglass-codec-server branch from b356b85 to d1915a5 Compare March 18, 2026 18:05
@daryllimyt daryllimyt changed the base branch from main to feat/temporal-agent-history-encryption March 18, 2026 18:33
@daryllimyt daryllimyt changed the base branch from feat/temporal-agent-history-encryption to main March 18, 2026 18:38
@daryllimyt daryllimyt force-pushed the feat/temporal-breakglass-codec-server branch from d1915a5 to b367063 Compare March 18, 2026 18:38
@blacksmith-sh

This comment has been minimized.

@zeropath-ai
Copy link
Copy Markdown

zeropath-ai Bot commented Mar 30, 2026

No security or compliance issues detected. Reviewed everything up to 7b894af.

Security Overview
Detected Code Changes

The diff is too large to display a summary of code changes.

@daryllimyt daryllimyt temporarily deployed to internal-registry-ci April 1, 2026 20:36 — with GitHub Actions Inactive
@daryllimyt daryllimyt temporarily deployed to internal-registry-ci April 1, 2026 20:36 — with GitHub Actions Inactive
@daryllimyt daryllimyt temporarily deployed to internal-registry-ci April 2, 2026 14:28 — with GitHub Actions Inactive
@daryllimyt daryllimyt temporarily deployed to internal-registry-ci April 2, 2026 14:28 — with GitHub Actions Inactive
@daryllimyt daryllimyt temporarily deployed to internal-registry-ci April 2, 2026 15:03 — with GitHub Actions Inactive
@daryllimyt daryllimyt temporarily deployed to internal-registry-ci April 2, 2026 15:03 — with GitHub Actions Inactive
@daryllimyt daryllimyt temporarily deployed to internal-registry-ci April 2, 2026 15:07 — with GitHub Actions Inactive
@daryllimyt daryllimyt temporarily deployed to internal-registry-ci April 2, 2026 15:07 — with GitHub Actions Inactive
@daryllimyt daryllimyt marked this pull request as ready for review April 2, 2026 15:15
@daryllimyt daryllimyt temporarily deployed to internal-registry-ci April 2, 2026 15:20 — with GitHub Actions Inactive
Copy link
Copy Markdown
Collaborator

@jordan-umusu jordan-umusu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few nits and two concerns:

  1. K8s coming separately?
  2. What does the rollout for this look like?

Comment thread deployments/fargate/main.tf Outdated
Comment thread tracecat/workflow/executions/schemas.py Outdated
@daryllimyt daryllimyt force-pushed the feat/temporal-breakglass-codec-server branch from 096d008 to c49c534 Compare April 21, 2026 15:51
@daryllimyt
Copy link
Copy Markdown
Contributor Author

daryllimyt commented Apr 21, 2026

K8s coming separately?

yes, internal PR

What does the rollout for this look like?

merge first with encryption disabled, then we enable it afterwards. this is safe because the codec can handle both encrypted and non-encrypted Payloads simultaneously

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c49c5348d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/temporal/codec.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c760b22af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/workflow/executions/common.py Outdated
@blacksmith-sh
Copy link
Copy Markdown
Contributor

blacksmith-sh Bot commented Apr 21, 2026

Found 1 test failure on Blacksmith runners:

Failure

Test View Logs
test_large_collection_regressions/test_scatter_gather_massive_payload_50x2mb_e2e View Logs

Fix in Cursor

Copy link
Copy Markdown
Collaborator

@jordan-umusu jordan-umusu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once CI passing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants