Wait-event coverage gaps: where backends block but pg_stat_activity reports NULL(iteration 3)

## Background

The "wait events for server logging" patch ([thread](https://www.postgresql.org/message-id/flat/CACdN0M78U%2BGvpqA7oey-GA7fFSYM636aDp6H9FVvCztv9zXxSA%40mail.gmail.com)) is one instance of a broader pattern: a backend blocks on something, but `pg_stat_activity.wait_event` is `NULL`, so it reads as "on CPU" and the stall is invisible. This issue catalogs the similar gaps so they can be tracked/prioritized.

Confirmed against the current tree: `src/backend/libpq/auth.c` and `contrib/postgres_fdw` contain **zero** `pgstat_report_wait_start` / `WAIT_EVENT_*` instrumentation.

## Genuine gaps — backend blocks but reports `NULL`

1. **Outbound libpq / FDW network waits.** `postgres_fdw`, `dblink`, and `libpqwalreceiver` block on a remote server through libpq with no wait event — there is no `ClientRead`/`ClientWrite` equivalent for *outbound* connections. A backend stuck on a slow foreign server looks idle-on-CPU. Likely the biggest real-world gap.

2. **Authentication.** `auth.c` is entirely uninstrumented: LDAP (`ldap_search`), PAM conversation, RADIUS socket, Kerberos/GSSAPI, SSPI, plus **DNS** (`getaddrinfo`) and reverse lookups during `pg_hba` matching. These block on external services, often for seconds. (Partly because auth runs before stats are fully set up.)

3. **TLS handshake.** `SSL_accept` and cert/CRL loading during connection setup — the negotiation itself is not a wait event (only steady-state ClientRead/Write are).

4. **`archive_command` / `restore_command`.** The archiver/startup process shells out via `system()` and blocks for the entire external command with no wait event. Recovery stalled on a slow `restore_command` is invisible.

5. **Non-VFD file ops.** Directory scans and metadata syscalls outside the smgr/File layer — `opendir`/`readdir` over `pg_wal`, tablespaces; `unlink`/`rename`/`stat` during startup and checkpoint segment recycling; config-file and `pg_hba`/SSL-file reads on reload. Most are not wrapped.

6. **The syslogger's own disk writes.** `write_syslogger_file()` `fwrite()` to the on-disk log file, plus rotation (`fopen`/`fclose`) and the `current_logfiles` metainfo write. This is where a slow log *device* actually blocks hardest. Caveat: the syslogger does not attach to shared memory and has no `pg_stat_activity` entry, so it currently *cannot* report wait events at all — fixing this needs a bigger change.

## By-design `NULL`, but arguably the same problem

7. **Pure CPU work** (sort, hash, aggregation, expression/qual eval, (de)compression). Genuinely on-CPU, so `NULL` is "correct" — but indistinguishable from an uninstrumented wait. This ambiguity is the meta-gap behind all of the above.

8. **Memory pressure** — `malloc`/`mmap`/page faults/swap-in. Backend blocked in the kernel, reports `NULL`. Hard to capture from userspace.

9. **Spinlocks.** LWLocks are instrumented; spinlock spin-delays under contention are not (by design — meant to be short, but pathological cases hide here).

## Structural root cause

`NULL` is overloaded: it means *both* "on CPU" *and* "blocked somewhere we didn't instrument." Two recurring directions worth pursuing:

- Give auxiliary/early processes (syslogger, archiver, some auth contexts) real backend-status entries so they *can* report.
- Add a way to positively signal "actually running" so `NULL` stops being a catch-all.

There is a generic `Extension` wait event extensions can adopt, but in-core libpq users like `postgres_fdw` don't, which is why #1 stays dark.

---
*Catalog compiled while reviewing the server-logging wait-events patch; each item is a candidate for its own focused patch.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait-event coverage gaps: where backends block but pg_stat_activity reports NULL(iteration 3) #49

Background

Genuine gaps — backend blocks but reports `NULL`

By-design `NULL`, but arguably the same problem

Structural root cause

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Wait-event coverage gaps: where backends block but pg_stat_activity reports NULL(iteration 3) #49

Description

Background

Genuine gaps — backend blocks but reports NULL

By-design NULL, but arguably the same problem

Structural root cause

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Genuine gaps — backend blocks but reports `NULL`

By-design `NULL`, but arguably the same problem