Skip to content

Add TAP tests for pg_resize_shared_buffers() with concurrent checkpoints#10

Draft
palak-chaturvedi wants to merge 5 commits into
ashutosh-bapat:dev/shbuf_resizefrom
palak-chaturvedi:dev/resizebuf-checkpoint-tests
Draft

Add TAP tests for pg_resize_shared_buffers() with concurrent checkpoints#10
palak-chaturvedi wants to merge 5 commits into
ashutosh-bapat:dev/shbuf_resizefrom
palak-chaturvedi:dev/resizebuf-checkpoint-tests

Conversation

@palak-chaturvedi

Copy link
Copy Markdown

Adds two TAP tests that exercise pg_resize_shared_buffers() against the checkpointer, using the ProcSignalBarrier injection points already present in buf_resize.c and three new injection points added to CreateCheckPoint():

006_resize_concurrent_checkpoint_crash.pl
- Walks the resize state machine through every consecutive pair of injection points in buf_resize.c (4 shrink pairs + 4 expand pairs).
- At each pair, issues a synchronous CHECKPOINT, crashes the server with 'immediate' stop, restarts, and verifies that crash recovery restores the intended shared_buffers value with no checksum validation errors and no PANIC.

007_resize_async_checkpoint_crash.pl
- Runs pg_resize_shared_buffers() concurrently with a CHECKPOINT that is parked at successive stages of CreateCheckPoint(). Eight scripted interleavings (S1-S4 shrink, E1-E4 expand) drive both the resize and the checkpointer forward through their injection points, then crash the server and verify crash recovery.

Supporting changes:

  • src/backend/access/transam/xlog.c: adds three INJECTION_POINT_LOAD calls outside the critical section for the new checkpoint stages (checkpoint-before-redo, checkpoint-before-redo-wal, checkpoint-after-redo-wal), plus the matching INJECTION_POINT_CACHED calls inside CreateCheckPoint(). The 'wait' action needs shmem allocation, which is why the LOADs are outside the critical section.

  • src/test/modules/injection_points: adds injection_points_has_waiter(text) RETURNS boolean, a non-destructive lookup useful as a poll target in TAP tests. Unlike a wait_for_event style poll against pg_stat_activity, this function only takes the injection_points shmem spinlock, so it cannot deadlock against a backend holding ProcArrayLock.

  • src/test/buffermgr/meson.build: registers the two new tests.

  • src/test/buffermgr/t/ResizeBuffer/Utils.pm: shared helpers used by both 006 and 007 (wait_injection_point via injection_points_has_waiter, detach-before-wake, %point_backend mapping, wakeup_all_known_points for END-block cleanup).

All 6 buffermgr TAP tests pass together (599 subtests) with enable_injection_points=yes and EXTRA_INSTALL='contrib/pg_buffercache src/test/modules/injection_points'.

Adds two TAP tests that exercise pg_resize_shared_buffers() against the
checkpointer, using the ProcSignalBarrier injection points already
present in buf_resize.c and three new injection points added to
CreateCheckPoint():

  006_resize_concurrent_checkpoint_crash.pl
    - Walks the resize state machine through every consecutive pair of
      injection points in buf_resize.c (4 shrink pairs + 4 expand pairs).
    - At each pair, issues a synchronous CHECKPOINT, crashes the server
      with 'immediate' stop, restarts, and verifies that crash recovery
      restores the intended shared_buffers value with no checksum
      validation errors and no PANIC.

  007_resize_async_checkpoint_crash.pl
    - Runs pg_resize_shared_buffers() concurrently with a CHECKPOINT
      that is parked at successive stages of CreateCheckPoint(). Eight
      scripted interleavings (S1-S4 shrink, E1-E4 expand) drive both
      the resize and the checkpointer forward through their injection
      points, then crash the server and verify crash recovery.

Supporting changes:

  * src/backend/access/transam/xlog.c: adds three INJECTION_POINT_LOAD
    calls outside the critical section for the new checkpoint stages
    (checkpoint-before-redo, checkpoint-before-redo-wal,
    checkpoint-after-redo-wal), plus the matching INJECTION_POINT_CACHED
    calls inside CreateCheckPoint(). The 'wait' action needs shmem
    allocation, which is why the LOADs are outside the critical section.

  * src/test/modules/injection_points: adds
    injection_points_has_waiter(text) RETURNS boolean, a non-destructive
    lookup useful as a poll target in TAP tests. Unlike a wait_for_event
    style poll against pg_stat_activity, this function only takes the
    injection_points shmem spinlock, so it cannot deadlock against a
    backend holding ProcArrayLock.

  * src/test/buffermgr/meson.build: registers the two new tests.

  * src/test/buffermgr/t/ResizeBuffer/Utils.pm: shared helpers used by
    both 006 and 007 (wait_injection_point via
    injection_points_has_waiter, detach-before-wake, %point_backend
    mapping, wakeup_all_known_points for END-block cleanup).

All 6 buffermgr TAP tests pass together (599 subtests) with
enable_injection_points=yes and EXTRA_INSTALL='contrib/pg_buffercache
src/test/modules/injection_points'.
An earlier iteration of the pg_resize_shared_buffers() TAP tests added a custom C helper, injection_points_has_waiter(), so tests could poll for a backend parked at a named injection point.  This has been replaced with upstream's $node->wait_for_event() (which polls pg_stat_activity.wait_event), the same helper used by src/test/buffermgr/t/004_client_join_buffer_resize.pl and other upstream tests.

Drop the C helper and its SQL binding.
Add t/002_resize_smoke.pl, a minimal sanity test that exercises pg_resize_shared_buffers() with no injection points, no concurrent workload, and no crash injection.  Failures here indicate the resize feature is fundamentally broken; failures in the checkpoint-race tests (006, 007) should not be trusted until this baseline is green.

Covers shrink round-trip (4MB -> 1MB -> 4MB), extreme range (256kB -> 32MB), and persistence across a clean restart.
The 006 and 007 crash-recovery tests run bt_index_check() on pgbench_accounts_pkey after recovery to verify that mid-resize crashes have not corrupted heap or index pages.  Without amcheck in EXTRA_INSTALL, 'make check' fails to load the extension.

meson already installs amcheck as part of the standard build.
Follow-up review pass on 006 and 007 and their shared Utils module. No behavioral change apart from stronger post-recovery checks:

* 006: promote pgbench sums invariant from a note to an is()
  assertion; drop dead 'page verification failed' unlike (this tree
  does not enable data_checksums, so the regex can never match); drop
  count(*) as it is subsumed by bt_index_check().

* 007: replace the copy-pasted per-scenario blocks with a data
  table driven by named injection-point constants; add
  bt_index_check() and the sums invariant so post-recovery integrity
  coverage matches 006; rewrite the 'ordering is approximate'
  comment now that the invariants are understood.

* Utils.pm: move %point_backend above attach_injection_point() so
  both attach and wait validate point names against it (typos die
  at attach time, not later at wait); drop the pgbench_extended
  and cointoss dead branches; trim historical/tutorial comments
  down to actionable ones.

Test count is stable at 33+33=66 after the cleanup, running in ~20s combined.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant