DA262: S3 backup (create-backup + list-backups) by delgod · Pull Request #59 · canonical/valkey-operator

delgod · 2026-05-13T13:39:09Z

Summary

Adds two Juju actions — create-backup and list-backups — that stream a fresh Valkey RDB snapshot from the targeted unit's valkey-cli --rdb - stdout directly to S3 via boto3 multipart upload. Any unit (leader or follower) can run the actions; locking is per-unit.

Credentials are supplied by relating the charm to the upstream s3-integrator via a new s3-credentials relation (interface s3, limit 1, optional).

Depends on

#58 (workload-exec-stream). Must be merged first — this PR uses workload.exec_stream and CliClient.build_command_prefix.

What's in the PR

src/
literals.py                # S3_RELATION_NAME, BACKUP_ID_FORMAT, BACKUP_CA_FILENAME
statuses.py                # BackupStatuses enum (3 entries)
common/exceptions.py       # ValkeyBackupError
core/base_workload.py      # tls_paths.backup_ca property
core/models.py             # s3_credentials on PeerAppModel,
                           # backup_id on PeerUnitModel,
                           # ValkeyServer.is_backup_in_progress,
                           # ValkeyCluster.s3_credentials (parsed JSON envelope)
core/cluster_state.py      # s3_relation property
managers/backup.py         # NEW – BackupManager + status protocol
events/backup.py           # NEW – BackupEvents + _exists_preventing_reason
events/base_events.py      # storage_detaching guard during backup
charm.py                   # wire BackupManager + BackupEvents;
                           # register with StatusHandler;
                           # restart_workload guard during backup
metadata.yaml                # requires.s3-credentials
actions.yaml                 # create-backup, list-backups
pyproject.toml               # boto3, mypy-boto3-s3
lib/charms/data_platform_libs/v0/s3.py   # via charmcraft fetch-lib
tests/unit/test_backup.py    # ~40 ops-scenario unit tests

Design highlights

Streaming, no archive storage. bucket.upload_fileobj(proc.stdout, key, Config=TransferConfig(multipart_chunksize=8 MiB)) reads chunks from the pipe; no on-disk staging.
Per-unit lock. The backup_id field in the running unit's peer-unit databag is both the lock value and the operation identifier. Two backups can run concurrently on different units (each gets a distinct S3 key); two on the same unit are rejected.
Cleanup on failure. Partial S3 objects are deleted (bucket.Object(key).delete()) when valkey-cli exits non-zero or upload_fileobj raises.
boto3 quirks honoured. Config(request_checksum_calculation="when_required", response_checksum_validation="when_required") (boto3#4400);
CreateBucketConfiguration omitted for us-east-1 (aws-sdk-js#3647);
idempotent error tokens (BucketAlreadyOwnedByYou, BucketAlreadyExists, BucketNameUnavailable).
TLS CA chain stored on every unit (not only the leader) so any unit can use TLS to talk to a self-signed S3 endpoint (e.g. RadosGW, MicroCeph).
Leader-switchover recovery. Re-observes leader_elected to re-trigger _on_s3_credentials_changed — covers the case where credentials_gone fired only on the old leader.

Action UX

# Any unit — leader or follower
juju run valkey/N create-backup
# → { "backup-id": "2026-05-13T10:00:00Z" }

juju run valkey/N list-backups
# → { "backups": "backup-id … | backup-status\n-…-\n2026-05-13T10:00:00Z | finished" }

Status surface:

BACKUP_IN_PROGRESS (maintenance, unit scope) while streaming
BACKUP_S3_PARAMETERS_MISSING (blocked, app scope) when the relation is present but the integrator hasn't supplied bucket/credentials yet
BACKUP_FAILED (blocked, unit scope) after a failure

Default base implementation raises NotImplementedError so the protocol and the abstract base class remain importable and instantiable in between this commit and the substrate-specific overrides. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Stderr drained on a daemon thread to keep the kernel pipe buffer from blocking the streaming consumer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Binary stdout (encoding=None), no harness timeout. Wraps ExecError into the (returncode, stderr) tuple convention. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Reusable by BackupManager so the streaming RDB command shares the same TLS/auth construction as the rest of valkey-cli usage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

_K8sProcessHandle.wait() called ExecProcess.wait_output(), which buffers the entire stdout (the whole RDB) into the charm container's memory and OOMKills it for datasets larger than the container memory limit. Use process.wait() for the exit code and drain stderr on a bounded daemon thread, mirroring the VM handle. Tests updated to use io.BytesIO instead of mocking wait_output(). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

--pass <password> on the valkey-cli command line is visible in /proc/<pid>/cmdline to any same-UID process; on K8s every container in the pod shares the PID namespace. build_command_prefix no longer emits --pass; WorkloadBase.exec / exec_stream gained an env parameter, and exec_cli_command now supplies the password through the REDISCLI_AUTH environment variable instead. The matching create_backup change (passing env= to exec_stream) lives in managers/backup.py and stays on the s3-backup branch alongside the backup manager it belongs to. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

assert statements are compiled away under python -O, so the postcondition check would silently vanish in an optimised charm. Raise RuntimeError explicitly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The drain thread had no exception guard (a read error vanished silently), and wait() joined it with a 5s timeout while the thread could still be blocked on read() -- risking a torn read of _stderr_buf. wait() now closes stderr first (unblocking the read) and joins without a timeout; the drain body is wrapped in try/except. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ferences The ProcessHandle Protocol carried only a one-line docstring, hiding behaviour the caller actually depends on: stdout is streamed (never fully buffered), wait()'s returncode may be a negative sentinel when the substrate cannot determine it, and stderr_text may be truncated to a bounded tail. Expand the Protocol and its method docstrings to state the contract, and bring the _VmProcessHandle docstring up to the same level of detail as _K8sProcessHandle so the VM/K8s differences are visible at each implementation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

_K8sProcessHandle.kill() logged every ops.pebble.Error at WARNING. The common case -- the exec already exited, so there is nothing to signal -- is benign and happens on every cleanup path, producing alarming noise. A genuinely unreachable Pebble, by contrast, means a possibly-still- running exec we can no longer stop and was being under-reported. Split the handling: ops.pebble.ConnectionError logs at ERROR, every other ops.pebble.Error logs at DEBUG. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

_VmProcessHandle drained stderr into an unbounded list. Over a long-running stream (a multi-GB RDB transfer) a chatty child could grow it without limit. Switch to collections.deque(maxlen=64), keeping only the tail -- matching _K8sProcessHandle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Apply ruff format/check to the workload P0/P1/P2 fixes cherry-picked onto this branch, mirroring the per-phase style commits these fixes originally carried on the s3-backup branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds the BackupManager, BackupEvents, charm wiring, cross-cutting guards (storage_detaching, restart_workload), state model (ValkeyCluster.s3_credentials property, ValkeyServer.is_backup_in_progress, ClusterState.s3_relation, tls_paths.backup_ca), and the create-backup and list-backups actions. Backup runs on any unit (not only leader): the per-unit backup_id field in the peer unit databag serves as the lock and identifier. Streams valkey-cli --rdb - stdout directly to S3 via boto3 upload_fileobj. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

proc.stdout is a non-rewindable pipe. A Tenacity retry over it uploads only the bytes left after the first attempt, silently producing a truncated RDB that passes as a valid backup. boto3 already retries individual multipart parts internally; rely on that. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

valkey-cli --rdb can exit 0 having written nothing (or a protocol error blob) to stdout, which previously landed in S3 as a "successful" backup that silently corrupts the cluster on restore. Wrap proc.stdout in a _CountingReader, and after upload reject anything that is empty or does not start with the REDIS/VALKEY magic header, deleting the bogus object. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

boto3 runs in the charm process. The S3 endpoint CA was written via workload.write_file into the workload container's TLS dir, which on K8s is a different filesystem -- so boto3's verify= pointed at a path that does not exist in the charm container (P1-1). It also placed an integrator-supplied CA into the directory valkey-cli/valkey-server trust for client mTLS, letting a malicious integrator authenticate as a Valkey user (P1-3). BackupManager now owns a charm-process-local CA path under charm_dir and reads/writes it with plain pathlib I/O; TLSPaths.backup_ca is removed. P1-1 and P1-3 are the same path-relocation change, committed together. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

create_backup now supplies the Valkey admin password to exec_stream through the REDISCLI_AUTH environment variable instead of leaving it on argv. The build_command_prefix / WorkloadBase.exec / CliClient side of P1-2 lives on the workload-exec-stream branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

create_bucket matched substrings of str(ClientError); non-AWS S3 backends localise or recase the message text, so a pre-existing bucket could be reported as a hard failure. Use e.response["Error"]["Code"] and chain the cause with `from e`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

A misconfigured integrator sending tls-ca-chain as a bare string made "\n".join iterate characters and write a corrupt CA bundle, breaking all uploads. store_tls_ca_chain now requires a list of strings and skips otherwise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The required-parameter check ran before the strip/rstrip normalisation, so path="/" (and bucket="/") collapsed to "" and were still written to the databag. An empty path makes list_backups enumerate the entire bucket -- a cross-tenant leak in a shared bucket. Validation now runs after normalisation and treats empty-after-strip as missing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

datetime.utcnow() is deprecated in Python 3.12 and removed in 3.14 (the charm tracks a 3.14 migration). Switch to datetime.now(timezone.utc); BACKUP_ID_FORMAT already carries the literal Z so the rendered id is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The boto3 default waiter polls 20 times at 5s intervals -- up to 100s blocking inside leader_elected when the S3 endpoint is slow or airgapped. Cap it at 5 attempts * 1s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

list_backups returned every object under the prefix, so stray uploads, lifecycle markers, or nested keys leaked into the operator-facing table. Filter results to the BACKUP_ID_FORMAT regex and query with an explicit "<path>/" prefix. Cause chained with `from e`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

leader_elected re-fires _on_s3_credentials_changed, which ran a synchronous create_bucket S3 round trip on every leader churn. Skip it when the normalised envelope matches what is already stored; a real credentials rotation still falls through. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The backup integration fixture installed MicroCeph ad-hoc inside the test function -- not idempotent across spread re-runs and invisible to substrate review. Declare it in concierge-vm.yaml / concierge-k8s.yaml host.snaps so substrate provisioning is handled at the right layer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

create_backup emitted almost no logs, leaving operators and forensics blind. Emit backup.started (backup_id, unit, bucket, endpoint -- never credentials), backup.completed (bytes, elapsed_seconds), and backup.failed records. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

_exists_preventing_reason returned "" for OK and took a skip_running_check flag -- a flag-parameter anti-pattern that split the method's meaning for one caller. Replaced with _blocking_reason() -> str | None covering the shared preconditions; the create-backup-only "already in progress" check is now applied inline in _on_create_backup_action. Docstring added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The column width was hard-coded to 22, which only happened to fit the current backup-id format. Compute it from the longest id so the table stays aligned if the id format ever changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The handlers below the comment are fully implemented; the comment misled anyone grepping for unfinished code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

A databag write failure in the finally block would mask the original ValkeyBackupError (or crash an otherwise-successful action). Wrap it in try/except and log; a leaked lock is recovered via the lock TTL and the troubleshooting runbook. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replaced the `# noqa: ANN001` suppressions with proper ops/S3-lib event types (CredentialsChangedEvent, CredentialsGoneEvent, ops.ActionEvent), matching the rest of the charm's event modules. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Build the S3 resource from a boto3.Session that carries the access/secret keys and region, rather than passing credentials straight to boto3.resource(). The Session confines the credentials to a single client object instead of leaning on process-global default-session state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Action results are readable by any Juju user, unlike juju debug-log. Returning str(exc) verbatim could leak the S3 endpoint, request/host ids, or RDB stream metadata into a lower-privilege surface. Add a _safe_error() helper that surfaces only the structured S3 error code (a fixed token such as "AccessDenied") and collapses everything else to a generic message. The create-backup and list-backups handlers now log the full exception to the unit log and return the sanitised string to the caller. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

P1-24 added a manager-level audit trail for the RDB transfer itself, but the Juju action invocation that triggered it was not recorded. Forensics on a leaked or unexpected RDB could not tie a backup to the action run that produced it. Log an "audit:" line at the start of both create-backup and list-backups handlers with the action id and the unit it ran on. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

state.statuses.get(...).root is the live list held inside the StatusObjectList pydantic model. Appending BACKUP_IN_PROGRESS / BACKUP_S3_PARAMETERS_MISSING directly to it mutated persisted state. Build the return value from a list() copy instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ic gate The static-analysis CI job added during the P1 remediation had never actually run -- pyright's bundled Node could not start in the CI image used at the time -- so the new code was never type-checked. Running pyright surfaces 21 real errors in the S3 backup surface: - BackupManager inherited `state: StatusesStateProtocol` from ManagerStatusProtocol, so every `self.state.cluster/.unit_server/...` access failed; narrow it to ClusterState (as the other managers do). - `_get_bucket_resource` returned the cache's `object` value, hiding every boto3 Bucket method; type the cache and return value as Bucket via the mypy-boto3-s3 stubs and cast the S3ServiceResource. - the envelope params were typed `dict[str, str]` / `dict[str, object]` inconsistently; unify on `dict[str, Any]` (the envelope genuinely holds both strings and the tls-ca-chain list). - guard `s3_credentials` being None in list_backups / create_backup. Two boto3-stub gaps that are correct at runtime (verified against MicroCeph) get a narrow `# pyright: ignore`: the resource waiter forwards WaiterConfig, and _CountingReader duck-types the read()-only slice of IO that upload_fileobj uses. Scope `[tool.pyright]` / `[testenv:static]` to the two backup feature files. The rest of src/ carries pre-existing pyright debt that predates this gate; widening the scope is tracked separately. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

delgod and others added 4 commits May 13, 2026 13:23

feat(workload-vm): implement exec_stream via subprocess.Popen

717d0f4

Stderr drained on a daemon thread to keep the kernel pipe buffer from blocking the streaming consumer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(workload-k8s): implement exec_stream via Pebble exec

ddff02e

Binary stdout (encoding=None), no harness timeout. Wraps ExecError into the (returncode, stderr) tuple convention. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

refactor(client): extract CliClient.build_command_prefix

5513c8e

Reusable by BackupManager so the streaming RDB command shares the same TLS/auth construction as the rest of valkey-cli usage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

delgod changed the title ~~DA262 S3 backup (create-backup + list-backups)~~ DA262: S3 backup (create-backup + list-backups) May 13, 2026

delgod mentioned this pull request May 13, 2026

DA261 S3 backup end-to-end against MicroCeph #60

Open

delgod and others added 24 commits May 14, 2026 12:14

fix(workload-vm): raise instead of assert on missing stdout pipe

36cd1ba

assert statements are compiled away under python -O, so the postcondition check would silently vanish in an optimised charm. Raise RuntimeError explicitly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

style: ruff format for workload exec_stream fixes

4351d56

Apply ruff format/check to the workload P0/P1/P2 fixes cherry-picked onto this branch, mirroring the per-phase style commits these fixes originally carried on the s3-backup branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(deps): add boto3 for S3 backup uploads

12bc5be

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(libs): vendor data_platform_libs/v0/s3 (S3Requirer)

5ee89e1

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat: add backup literals and ValkeyBackupError

140904d

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(statuses): add BackupStatuses enum

c7d7257

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

perf(backup): bound bucket wait_until_exists polling

d41e8f7

The boto3 default waiter polls 20 times at 5s intervals -- up to 100s blocking inside leader_elected when the S3 endpoint is slow or airgapped. Cap it at 5 attempts * 1s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

delgod and others added 13 commits May 14, 2026 13:58

style: ruff format for P1 fixes

3b650fb

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs: drop stale "stubs to be filled in" comment

3ea73b8

The handlers below the comment are fully implemented; the comment misled anyone grepping for unfinished code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

delgod force-pushed the s3-backup branch from 8a666e6 to 7f945d7 Compare May 14, 2026 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DA262: S3 backup (create-backup + list-backups)#59

DA262: S3 backup (create-backup + list-backups)#59
delgod wants to merge 41 commits into
9/edgefrom
s3-backup

delgod commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

delgod commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Depends on

What's in the PR

Design highlights

Action UX

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

delgod commented May 13, 2026 •

edited

Loading