Conversation
Default base implementation raises NotImplementedError so the protocol and the abstract base class remain importable and instantiable in between this commit and the substrate-specific overrides. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stderr drained on a daemon thread to keep the kernel pipe buffer from blocking the streaming consumer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Binary stdout (encoding=None), no harness timeout. Wraps ExecError into the (returncode, stderr) tuple convention. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reusable by BackupManager so the streaming RDB command shares the same TLS/auth construction as the rest of valkey-cli usage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_K8sProcessHandle.wait() called ExecProcess.wait_output(), which buffers the entire stdout (the whole RDB) into the charm container's memory and OOMKills it for datasets larger than the container memory limit. Use process.wait() for the exit code and drain stderr on a bounded daemon thread, mirroring the VM handle. Tests updated to use io.BytesIO instead of mocking wait_output(). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
--pass <password> on the valkey-cli command line is visible in /proc/<pid>/cmdline to any same-UID process; on K8s every container in the pod shares the PID namespace. build_command_prefix no longer emits --pass; WorkloadBase.exec / exec_stream gained an env parameter, and exec_cli_command now supplies the password through the REDISCLI_AUTH environment variable instead. The matching create_backup change (passing env= to exec_stream) lives in managers/backup.py and stays on the s3-backup branch alongside the backup manager it belongs to. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
assert statements are compiled away under python -O, so the postcondition check would silently vanish in an optimised charm. Raise RuntimeError explicitly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The drain thread had no exception guard (a read error vanished silently), and wait() joined it with a 5s timeout while the thread could still be blocked on read() -- risking a torn read of _stderr_buf. wait() now closes stderr first (unblocking the read) and joins without a timeout; the drain body is wrapped in try/except. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ferences The ProcessHandle Protocol carried only a one-line docstring, hiding behaviour the caller actually depends on: stdout is streamed (never fully buffered), wait()'s returncode may be a negative sentinel when the substrate cannot determine it, and stderr_text may be truncated to a bounded tail. Expand the Protocol and its method docstrings to state the contract, and bring the _VmProcessHandle docstring up to the same level of detail as _K8sProcessHandle so the VM/K8s differences are visible at each implementation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_K8sProcessHandle.kill() logged every ops.pebble.Error at WARNING. The common case -- the exec already exited, so there is nothing to signal -- is benign and happens on every cleanup path, producing alarming noise. A genuinely unreachable Pebble, by contrast, means a possibly-still- running exec we can no longer stop and was being under-reported. Split the handling: ops.pebble.ConnectionError logs at ERROR, every other ops.pebble.Error logs at DEBUG. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_VmProcessHandle drained stderr into an unbounded list. Over a long-running stream (a multi-GB RDB transfer) a chatty child could grow it without limit. Switch to collections.deque(maxlen=64), keeping only the tail -- matching _K8sProcessHandle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Apply ruff format/check to the workload P0/P1/P2 fixes cherry-picked onto this branch, mirroring the per-phase style commits these fixes originally carried on the s3-backup branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the BackupManager, BackupEvents, charm wiring, cross-cutting guards (storage_detaching, restart_workload), state model (ValkeyCluster.s3_credentials property, ValkeyServer.is_backup_in_progress, ClusterState.s3_relation, tls_paths.backup_ca), and the create-backup and list-backups actions. Backup runs on any unit (not only leader): the per-unit backup_id field in the peer unit databag serves as the lock and identifier. Streams valkey-cli --rdb - stdout directly to S3 via boto3 upload_fileobj. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
proc.stdout is a non-rewindable pipe. A Tenacity retry over it uploads only the bytes left after the first attempt, silently producing a truncated RDB that passes as a valid backup. boto3 already retries individual multipart parts internally; rely on that. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
valkey-cli --rdb can exit 0 having written nothing (or a protocol error blob) to stdout, which previously landed in S3 as a "successful" backup that silently corrupts the cluster on restore. Wrap proc.stdout in a _CountingReader, and after upload reject anything that is empty or does not start with the REDIS/VALKEY magic header, deleting the bogus object. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
boto3 runs in the charm process. The S3 endpoint CA was written via workload.write_file into the workload container's TLS dir, which on K8s is a different filesystem -- so boto3's verify= pointed at a path that does not exist in the charm container (P1-1). It also placed an integrator-supplied CA into the directory valkey-cli/valkey-server trust for client mTLS, letting a malicious integrator authenticate as a Valkey user (P1-3). BackupManager now owns a charm-process-local CA path under charm_dir and reads/writes it with plain pathlib I/O; TLSPaths.backup_ca is removed. P1-1 and P1-3 are the same path-relocation change, committed together. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
create_backup now supplies the Valkey admin password to exec_stream through the REDISCLI_AUTH environment variable instead of leaving it on argv. The build_command_prefix / WorkloadBase.exec / CliClient side of P1-2 lives on the workload-exec-stream branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
create_bucket matched substrings of str(ClientError); non-AWS S3 backends localise or recase the message text, so a pre-existing bucket could be reported as a hard failure. Use e.response["Error"]["Code"] and chain the cause with `from e`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A misconfigured integrator sending tls-ca-chain as a bare string made "\n".join iterate characters and write a corrupt CA bundle, breaking all uploads. store_tls_ca_chain now requires a list of strings and skips otherwise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The required-parameter check ran before the strip/rstrip normalisation, so path="/" (and bucket="/") collapsed to "" and were still written to the databag. An empty path makes list_backups enumerate the entire bucket -- a cross-tenant leak in a shared bucket. Validation now runs after normalisation and treats empty-after-strip as missing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
datetime.utcnow() is deprecated in Python 3.12 and removed in 3.14 (the charm tracks a 3.14 migration). Switch to datetime.now(timezone.utc); BACKUP_ID_FORMAT already carries the literal Z so the rendered id is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The boto3 default waiter polls 20 times at 5s intervals -- up to 100s blocking inside leader_elected when the S3 endpoint is slow or airgapped. Cap it at 5 attempts * 1s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
list_backups returned every object under the prefix, so stray uploads, lifecycle markers, or nested keys leaked into the operator-facing table. Filter results to the BACKUP_ID_FORMAT regex and query with an explicit "<path>/" prefix. Cause chained with `from e`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
leader_elected re-fires _on_s3_credentials_changed, which ran a synchronous create_bucket S3 round trip on every leader churn. Skip it when the normalised envelope matches what is already stored; a real credentials rotation still falls through. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The backup integration fixture installed MicroCeph ad-hoc inside the test function -- not idempotent across spread re-runs and invisible to substrate review. Declare it in concierge-vm.yaml / concierge-k8s.yaml host.snaps so substrate provisioning is handled at the right layer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
create_backup emitted almost no logs, leaving operators and forensics blind. Emit backup.started (backup_id, unit, bucket, endpoint -- never credentials), backup.completed (bytes, elapsed_seconds), and backup.failed records. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_exists_preventing_reason returned "" for OK and took a skip_running_check flag -- a flag-parameter anti-pattern that split the method's meaning for one caller. Replaced with _blocking_reason() -> str | None covering the shared preconditions; the create-backup-only "already in progress" check is now applied inline in _on_create_backup_action. Docstring added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The column width was hard-coded to 22, which only happened to fit the current backup-id format. Compute it from the longest id so the table stays aligned if the id format ever changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The handlers below the comment are fully implemented; the comment misled anyone grepping for unfinished code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A databag write failure in the finally block would mask the original ValkeyBackupError (or crash an otherwise-successful action). Wrap it in try/except and log; a leaked lock is recovered via the lock TTL and the troubleshooting runbook. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaced the `# noqa: ANN001` suppressions with proper ops/S3-lib event types (CredentialsChangedEvent, CredentialsGoneEvent, ops.ActionEvent), matching the rest of the charm's event modules. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Build the S3 resource from a boto3.Session that carries the access/secret keys and region, rather than passing credentials straight to boto3.resource(). The Session confines the credentials to a single client object instead of leaning on process-global default-session state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Action results are readable by any Juju user, unlike juju debug-log. Returning str(exc) verbatim could leak the S3 endpoint, request/host ids, or RDB stream metadata into a lower-privilege surface. Add a _safe_error() helper that surfaces only the structured S3 error code (a fixed token such as "AccessDenied") and collapses everything else to a generic message. The create-backup and list-backups handlers now log the full exception to the unit log and return the sanitised string to the caller. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
P1-24 added a manager-level audit trail for the RDB transfer itself, but the Juju action invocation that triggered it was not recorded. Forensics on a leaked or unexpected RDB could not tie a backup to the action run that produced it. Log an "audit:" line at the start of both create-backup and list-backups handlers with the action id and the unit it ran on. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
state.statuses.get(...).root is the live list held inside the StatusObjectList pydantic model. Appending BACKUP_IN_PROGRESS / BACKUP_S3_PARAMETERS_MISSING directly to it mutated persisted state. Build the return value from a list() copy instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ic gate
The static-analysis CI job added during the P1 remediation had never
actually run -- pyright's bundled Node could not start in the CI image
used at the time -- so the new code was never type-checked.
Running pyright surfaces 21 real errors in the S3 backup surface:
- BackupManager inherited `state: StatusesStateProtocol` from
ManagerStatusProtocol, so every `self.state.cluster/.unit_server/...`
access failed; narrow it to ClusterState (as the other managers do).
- `_get_bucket_resource` returned the cache's `object` value, hiding
every boto3 Bucket method; type the cache and return value as Bucket
via the mypy-boto3-s3 stubs and cast the S3ServiceResource.
- the envelope params were typed `dict[str, str]` / `dict[str, object]`
inconsistently; unify on `dict[str, Any]` (the envelope genuinely
holds both strings and the tls-ca-chain list).
- guard `s3_credentials` being None in list_backups / create_backup.
Two boto3-stub gaps that are correct at runtime (verified against
MicroCeph) get a narrow `# pyright: ignore`: the resource waiter
forwards WaiterConfig, and _CountingReader duck-types the read()-only
slice of IO that upload_fileobj uses.
Scope `[tool.pyright]` / `[testenv:static]` to the two backup feature
files. The rest of src/ carries pre-existing pyright debt that predates
this gate; widening the scope is tracked separately.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds two Juju actions —
create-backupandlist-backups— that stream a fresh Valkey RDB snapshot from the targeted unit'svalkey-cli --rdb -stdout directly to S3 via boto3 multipart upload. Any unit (leader or follower) can run the actions; locking is per-unit.Credentials are supplied by relating the charm to the upstream
s3-integratorvia a news3-credentialsrelation (interfaces3, limit 1, optional).Depends on
#58 (
workload-exec-stream). Must be merged first — this PR usesworkload.exec_streamandCliClient.build_command_prefix.What's in the PR
Design highlights
bucket.upload_fileobj(proc.stdout, key, Config=TransferConfig(multipart_chunksize=8 MiB))reads chunks from the pipe; no on-disk staging.backup_idfield in the running unit's peer-unit databag is both the lock value and the operation identifier. Two backups can run concurrently on different units (each gets a distinct S3 key); two on the same unit are rejected.bucket.Object(key).delete()) whenvalkey-cliexits non-zero orupload_fileobjraises.Config(request_checksum_calculation="when_required", response_checksum_validation="when_required")(boto3#4400);CreateBucketConfigurationomitted forus-east-1(aws-sdk-js#3647);idempotent error tokens (
BucketAlreadyOwnedByYou,BucketAlreadyExists,BucketNameUnavailable).leader_electedto re-trigger_on_s3_credentials_changed— covers the case wherecredentials_gonefired only on the old leader.Action UX
Status surface:
BACKUP_IN_PROGRESS(maintenance, unit scope) while streamingBACKUP_S3_PARAMETERS_MISSING(blocked, app scope) when the relation is present but the integrator hasn't supplied bucket/credentials yetBACKUP_FAILED(blocked, unit scope) after a failure