Skip to content

DA262: S3 backup (create-backup + list-backups)#59

Open
delgod wants to merge 41 commits into
9/edgefrom
s3-backup
Open

DA262: S3 backup (create-backup + list-backups)#59
delgod wants to merge 41 commits into
9/edgefrom
s3-backup

Conversation

@delgod
Copy link
Copy Markdown
Member

@delgod delgod commented May 13, 2026

Summary

Adds two Juju actions — create-backup and list-backups — that stream a fresh Valkey RDB snapshot from the targeted unit's valkey-cli --rdb - stdout directly to S3 via boto3 multipart upload. Any unit (leader or follower) can run the actions; locking is per-unit.

Credentials are supplied by relating the charm to the upstream s3-integrator via a new s3-credentials relation (interface s3, limit 1, optional).

Depends on

#58 (workload-exec-stream). Must be merged first — this PR uses workload.exec_stream and CliClient.build_command_prefix.

What's in the PR

src/
literals.py                # S3_RELATION_NAME, BACKUP_ID_FORMAT, BACKUP_CA_FILENAME
statuses.py                # BackupStatuses enum (3 entries)
common/exceptions.py       # ValkeyBackupError
core/base_workload.py      # tls_paths.backup_ca property
core/models.py             # s3_credentials on PeerAppModel,
                           # backup_id on PeerUnitModel,
                           # ValkeyServer.is_backup_in_progress,
                           # ValkeyCluster.s3_credentials (parsed JSON envelope)
core/cluster_state.py      # s3_relation property
managers/backup.py         # NEW – BackupManager + status protocol
events/backup.py           # NEW – BackupEvents + _exists_preventing_reason
events/base_events.py      # storage_detaching guard during backup
charm.py                   # wire BackupManager + BackupEvents;
                           # register with StatusHandler;
                           # restart_workload guard during backup
metadata.yaml                # requires.s3-credentials
actions.yaml                 # create-backup, list-backups
pyproject.toml               # boto3, mypy-boto3-s3
lib/charms/data_platform_libs/v0/s3.py   # via charmcraft fetch-lib
tests/unit/test_backup.py    # ~40 ops-scenario unit tests

Design highlights

  • Streaming, no archive storage. bucket.upload_fileobj(proc.stdout, key, Config=TransferConfig(multipart_chunksize=8 MiB)) reads chunks from the pipe; no on-disk staging.
  • Per-unit lock. The backup_id field in the running unit's peer-unit databag is both the lock value and the operation identifier. Two backups can run concurrently on different units (each gets a distinct S3 key); two on the same unit are rejected.
  • Cleanup on failure. Partial S3 objects are deleted (bucket.Object(key).delete()) when valkey-cli exits non-zero or upload_fileobj raises.
  • boto3 quirks honoured. Config(request_checksum_calculation="when_required", response_checksum_validation="when_required") (boto3#4400);
    CreateBucketConfiguration omitted for us-east-1 (aws-sdk-js#3647);
    idempotent error tokens (BucketAlreadyOwnedByYou, BucketAlreadyExists, BucketNameUnavailable).
  • TLS CA chain stored on every unit (not only the leader) so any unit can use TLS to talk to a self-signed S3 endpoint (e.g. RadosGW, MicroCeph).
  • Leader-switchover recovery. Re-observes leader_elected to re-trigger _on_s3_credentials_changed — covers the case where credentials_gone fired only on the old leader.

Action UX

# Any unit — leader or follower
juju run valkey/N create-backup
# → { "backup-id": "2026-05-13T10:00:00Z" }

juju run valkey/N list-backups
# → { "backups": "backup-id … | backup-status\n-…-\n2026-05-13T10:00:00Z | finished" }

Status surface:

  • BACKUP_IN_PROGRESS (maintenance, unit scope) while streaming
  • BACKUP_S3_PARAMETERS_MISSING (blocked, app scope) when the relation is present but the integrator hasn't supplied bucket/credentials yet
  • BACKUP_FAILED (blocked, unit scope) after a failure

delgod and others added 4 commits May 13, 2026 13:23
Default base implementation raises NotImplementedError so the protocol
and the abstract base class remain importable and instantiable in
between this commit and the substrate-specific overrides.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stderr drained on a daemon thread to keep the kernel pipe buffer from
blocking the streaming consumer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Binary stdout (encoding=None), no harness timeout. Wraps ExecError into
the (returncode, stderr) tuple convention.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reusable by BackupManager so the streaming RDB command shares the same
TLS/auth construction as the rest of valkey-cli usage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@delgod delgod changed the title DA262 S3 backup (create-backup + list-backups) DA262: S3 backup (create-backup + list-backups) May 13, 2026
delgod and others added 24 commits May 14, 2026 12:14
_K8sProcessHandle.wait() called ExecProcess.wait_output(), which buffers
the entire stdout (the whole RDB) into the charm container's memory and
OOMKills it for datasets larger than the container memory limit. Use
process.wait() for the exit code and drain stderr on a bounded daemon
thread, mirroring the VM handle. Tests updated to use io.BytesIO instead
of mocking wait_output().

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
--pass <password> on the valkey-cli command line is visible in
/proc/<pid>/cmdline to any same-UID process; on K8s every container in
the pod shares the PID namespace. build_command_prefix no longer emits
--pass; WorkloadBase.exec / exec_stream gained an env parameter, and
exec_cli_command now supplies the password through the REDISCLI_AUTH
environment variable instead.

The matching create_backup change (passing env= to exec_stream) lives
in managers/backup.py and stays on the s3-backup branch alongside the
backup manager it belongs to.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
assert statements are compiled away under python -O, so the postcondition
check would silently vanish in an optimised charm. Raise RuntimeError
explicitly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The drain thread had no exception guard (a read error vanished
silently), and wait() joined it with a 5s timeout while the thread could
still be blocked on read() -- risking a torn read of _stderr_buf. wait()
now closes stderr first (unblocking the read) and joins without a
timeout; the drain body is wrapped in try/except.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ferences

The ProcessHandle Protocol carried only a one-line docstring, hiding
behaviour the caller actually depends on: stdout is streamed (never
fully buffered), wait()'s returncode may be a negative sentinel when
the substrate cannot determine it, and stderr_text may be truncated to
a bounded tail.

Expand the Protocol and its method docstrings to state the contract,
and bring the _VmProcessHandle docstring up to the same level of
detail as _K8sProcessHandle so the VM/K8s differences are visible at
each implementation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_K8sProcessHandle.kill() logged every ops.pebble.Error at WARNING. The
common case -- the exec already exited, so there is nothing to signal --
is benign and happens on every cleanup path, producing alarming noise.
A genuinely unreachable Pebble, by contrast, means a possibly-still-
running exec we can no longer stop and was being under-reported.

Split the handling: ops.pebble.ConnectionError logs at ERROR, every
other ops.pebble.Error logs at DEBUG.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_VmProcessHandle drained stderr into an unbounded list. Over a
long-running stream (a multi-GB RDB transfer) a chatty child could grow
it without limit. Switch to collections.deque(maxlen=64), keeping only
the tail -- matching _K8sProcessHandle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Apply ruff format/check to the workload P0/P1/P2 fixes cherry-picked
onto this branch, mirroring the per-phase style commits these fixes
originally carried on the s3-backup branch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the BackupManager, BackupEvents, charm wiring, cross-cutting
guards (storage_detaching, restart_workload), state model
(ValkeyCluster.s3_credentials property, ValkeyServer.is_backup_in_progress,
ClusterState.s3_relation, tls_paths.backup_ca), and the create-backup
and list-backups actions.

Backup runs on any unit (not only leader): the per-unit backup_id field
in the peer unit databag serves as the lock and identifier. Streams
valkey-cli --rdb - stdout directly to S3 via boto3 upload_fileobj.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
proc.stdout is a non-rewindable pipe. A Tenacity retry over it uploads
only the bytes left after the first attempt, silently producing a
truncated RDB that passes as a valid backup. boto3 already retries
individual multipart parts internally; rely on that.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
valkey-cli --rdb can exit 0 having written nothing (or a protocol error
blob) to stdout, which previously landed in S3 as a "successful" backup
that silently corrupts the cluster on restore. Wrap proc.stdout in a
_CountingReader, and after upload reject anything that is empty or does
not start with the REDIS/VALKEY magic header, deleting the bogus object.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
boto3 runs in the charm process. The S3 endpoint CA was written via
workload.write_file into the workload container's TLS dir, which on K8s
is a different filesystem -- so boto3's verify= pointed at a path that
does not exist in the charm container (P1-1). It also placed an
integrator-supplied CA into the directory valkey-cli/valkey-server trust
for client mTLS, letting a malicious integrator authenticate as a Valkey
user (P1-3).

BackupManager now owns a charm-process-local CA path under charm_dir and
reads/writes it with plain pathlib I/O; TLSPaths.backup_ca is removed.
P1-1 and P1-3 are the same path-relocation change, committed together.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
create_backup now supplies the Valkey admin password to exec_stream
through the REDISCLI_AUTH environment variable instead of leaving it on
argv. The build_command_prefix / WorkloadBase.exec / CliClient side of
P1-2 lives on the workload-exec-stream branch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
create_bucket matched substrings of str(ClientError); non-AWS S3 backends
localise or recase the message text, so a pre-existing bucket could be
reported as a hard failure. Use e.response["Error"]["Code"] and chain the
cause with `from e`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A misconfigured integrator sending tls-ca-chain as a bare string made
"\n".join iterate characters and write a corrupt CA bundle, breaking all
uploads. store_tls_ca_chain now requires a list of strings and skips
otherwise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The required-parameter check ran before the strip/rstrip normalisation,
so path="/" (and bucket="/") collapsed to "" and were still written to
the databag. An empty path makes list_backups enumerate the entire
bucket -- a cross-tenant leak in a shared bucket. Validation now runs
after normalisation and treats empty-after-strip as missing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
datetime.utcnow() is deprecated in Python 3.12 and removed in 3.14 (the
charm tracks a 3.14 migration). Switch to datetime.now(timezone.utc);
BACKUP_ID_FORMAT already carries the literal Z so the rendered id is
unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The boto3 default waiter polls 20 times at 5s intervals -- up to 100s
blocking inside leader_elected when the S3 endpoint is slow or
airgapped. Cap it at 5 attempts * 1s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
list_backups returned every object under the prefix, so stray uploads,
lifecycle markers, or nested keys leaked into the operator-facing table.
Filter results to the BACKUP_ID_FORMAT regex and query with an explicit
"<path>/" prefix. Cause chained with `from e`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
leader_elected re-fires _on_s3_credentials_changed, which ran a
synchronous create_bucket S3 round trip on every leader churn. Skip it
when the normalised envelope matches what is already stored; a real
credentials rotation still falls through.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
delgod and others added 13 commits May 14, 2026 13:58
The backup integration fixture installed MicroCeph ad-hoc inside the
test function -- not idempotent across spread re-runs and invisible to
substrate review. Declare it in concierge-vm.yaml / concierge-k8s.yaml
host.snaps so substrate provisioning is handled at the right layer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
create_backup emitted almost no logs, leaving operators and forensics
blind. Emit backup.started (backup_id, unit, bucket, endpoint -- never
credentials), backup.completed (bytes, elapsed_seconds), and
backup.failed records.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_exists_preventing_reason returned "" for OK and took a skip_running_check
flag -- a flag-parameter anti-pattern that split the method's meaning for
one caller. Replaced with _blocking_reason() -> str | None covering the
shared preconditions; the create-backup-only "already in progress" check
is now applied inline in _on_create_backup_action. Docstring added.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The column width was hard-coded to 22, which only happened to fit the
current backup-id format. Compute it from the longest id so the table
stays aligned if the id format ever changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The handlers below the comment are fully implemented; the comment misled
anyone grepping for unfinished code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A databag write failure in the finally block would mask the original
ValkeyBackupError (or crash an otherwise-successful action). Wrap it in
try/except and log; a leaked lock is recovered via the lock TTL and the
troubleshooting runbook.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaced the `# noqa: ANN001` suppressions with proper ops/S3-lib event
types (CredentialsChangedEvent, CredentialsGoneEvent, ops.ActionEvent),
matching the rest of the charm's event modules.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Build the S3 resource from a boto3.Session that carries the
access/secret keys and region, rather than passing credentials
straight to boto3.resource(). The Session confines the credentials
to a single client object instead of leaning on process-global
default-session state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Action results are readable by any Juju user, unlike juju debug-log.
Returning str(exc) verbatim could leak the S3 endpoint, request/host
ids, or RDB stream metadata into a lower-privilege surface.

Add a _safe_error() helper that surfaces only the structured S3 error
code (a fixed token such as "AccessDenied") and collapses everything
else to a generic message. The create-backup and list-backups handlers
now log the full exception to the unit log and return the sanitised
string to the caller.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
P1-24 added a manager-level audit trail for the RDB transfer itself,
but the Juju action invocation that triggered it was not recorded.
Forensics on a leaked or unexpected RDB could not tie a backup to the
action run that produced it.

Log an "audit:" line at the start of both create-backup and
list-backups handlers with the action id and the unit it ran on.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
state.statuses.get(...).root is the live list held inside the
StatusObjectList pydantic model. Appending BACKUP_IN_PROGRESS /
BACKUP_S3_PARAMETERS_MISSING directly to it mutated persisted state.
Build the return value from a list() copy instead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ic gate

The static-analysis CI job added during the P1 remediation had never
actually run -- pyright's bundled Node could not start in the CI image
used at the time -- so the new code was never type-checked.

Running pyright surfaces 21 real errors in the S3 backup surface:
  - BackupManager inherited `state: StatusesStateProtocol` from
    ManagerStatusProtocol, so every `self.state.cluster/.unit_server/...`
    access failed; narrow it to ClusterState (as the other managers do).
  - `_get_bucket_resource` returned the cache's `object` value, hiding
    every boto3 Bucket method; type the cache and return value as Bucket
    via the mypy-boto3-s3 stubs and cast the S3ServiceResource.
  - the envelope params were typed `dict[str, str]` / `dict[str, object]`
    inconsistently; unify on `dict[str, Any]` (the envelope genuinely
    holds both strings and the tls-ca-chain list).
  - guard `s3_credentials` being None in list_backups / create_backup.

Two boto3-stub gaps that are correct at runtime (verified against
MicroCeph) get a narrow `# pyright: ignore`: the resource waiter
forwards WaiterConfig, and _CountingReader duck-types the read()-only
slice of IO that upload_fileobj uses.

Scope `[tool.pyright]` / `[testenv:static]` to the two backup feature
files. The rest of src/ carries pre-existing pyright debt that predates
this gate; widening the scope is tracked separately.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant