Skip to content

docs: add how-to guide for debugging Kubernetes charms#2498

Open
tonyandrewmeyer wants to merge 3 commits into
canonical:mainfrom
tonyandrewmeyer:docs-debug-k8s
Open

docs: add how-to guide for debugging Kubernetes charms#2498
tonyandrewmeyer wants to merge 3 commits into
canonical:mainfrom
tonyandrewmeyer:docs-debug-k8s

Conversation

@tonyandrewmeyer
Copy link
Copy Markdown
Collaborator

@tonyandrewmeyer tonyandrewmeyer commented May 22, 2026

This PR adds a follow-on guide to the recent how-to for debugging, specifically focused on K8s charms and using Pebble.

At the recent sprints we received a couple of comments that more information was needed for debugging in this specific case, so this is addressing those.

The main focus is on Pebble, but there's a little bit for K8s directly, without going all the way into being a guide for debugging K8s itself.

Preview

Fixes #2489

Comment thread docs/howto/debug-a-kubernetes-charm.md Outdated
Copy link
Copy Markdown
Contributor

@dwilding dwilding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for compiling this! I need to review in more detail, but have taken a first pass.

I think it would be easier for people to orient themselves if we move "Common failure modes" nearer the beginning of the doc - probably after "Know which container you’re looking at". I think that section is a great quick reference and should be (slightly) expanded by migrating other content from around the doc.

I've commented on the pieces I think we should move.

My thinking is that we should make a cleaner split between the why and the how. If you already know why you need to be reading a particular section, there should be minimal intro text. Get right into the how. But if you don't know which section you should be reading, "Common failure modes" points you in the right direction and helps you understand why.

Let me know if you'd like to discuss this suggestion together. I'm also very happy to experiment with different structures if that would help.

Comment on lines +32 to +34
```{tip}
If [`Container.can_connect()`](ops.Container.can_connect) returns `False` or your charm raises [`ops.pebble.ConnectionError`](ops.pebble.ConnectionError), the charm container cannot reach the workload's Pebble over that socket. This usually means the workload container hasn't started yet (no [`PebbleReadyEvent`](ops.PebbleReadyEvent) has fired) -- look at the pod first (see [](#k8s-inspect-the-pod)), not at your charm code.
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest migrating this tip

(k8s-debug-from-charm-container)=
## Debug from the charm container

Many production workload images are stripped down to just the application -- with no shell or utilities -- so `juju ssh --container` lands you nowhere useful. You can still run Pebble commands against that workload from the charm container, because the workload's socket is mounted there:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest migrating most of this

(k8s-inspect-the-pod)=
## Inspect the pod at the Kubernetes layer

When a unit is stuck before Pebble is even reachable -- the container is `waiting`, the image won't pull, or the pod won't schedule -- the answer is below Juju, at the Kubernetes layer. Juju puts each model in its own namespace, and names each unit's pod `<app>-<unit-number>`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest migrating the first sentence

Comment on lines +167 to +177
(k8s-common-failure-modes)=
## Common failure modes

| Symptom | Where to look |
| --- | --- |
| Charm stuck in `maintenance`/`waiting`; `can_connect()` is `False` | The workload container hasn't started -- `kubectl describe pod` for image-pull or scheduling errors ([](#k8s-inspect-the-pod)). |
| Service shows `backoff` or `error` | `pebble logs` for the crash output, then `pebble changes` / `pebble tasks` for the start failure ([](#k8s-pebble-cli)). |
| Config change has no effect on the running process | The charm added a layer but didn't [`replan`](#run-workloads-with-a-charm-kubernetes-replan); confirm with `pebble plan` and `pebble services`. |
| Charm raises `ConnectionError` mid-handler | The workload's Pebble became unreachable -- guard Pebble calls with `try`/`except` rather than `can_connect()` ([](ops.Container.can_connect)). |
| `pebble_custom_notice` never fires | Confirm the notice was recorded with `pebble notices`; check the `key` your handler matches on ([](#k8s-pebble-cli)). |
| Workload won't go ready despite running | A health check is failing -- `pebble checks` and `pebble check <name> --refresh` ([](#k8s-pebble-cli)). |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest migrating this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend the debugging how-to with Pebble-specific advice

2 participants