Skip to content

Database::~Database() segfault during checkpoint flush + per-write throughput collapse on ~60K-node graphs (0.16.0 and 0.16.1) #452

@DennisRathgeb

Description

@DennisRathgeb

TL;DR

On a graph of ~60K nodes / ~63K edges (size-of-graph: ~100 MB on disk), Ladybug 0.16.x exhibits two failure modes from a Python embedding (ladybug / _lbug.cpython-314-x86_64-linux-gnu.so):

  1. Per-write throughput collapse during a CPU-bound mutation phase. ~17K node upserts that complete in ~11 seconds on an in-memory NetworkX backend never make observable progress in >13 minutes on Ladybug — page-thrash signatures (RSS oscillating 100 MB ↔ 1.5 GB while VSZ holds 18 GB) suggest mmap'd page churn under per-write checkpoint accounting.
  2. SIGSEGV inside lbug::storage::NodeTable::checkpoint(...) during lbug::main::Database::~Database() — i.e., the destructor's checkpoint flush at process shutdown faults inside the page allocator.

Tested versions: 0.16.0 (initial repro), 0.16.1 (re-tested after upgrade — same throughput collapse; killed at 16 min before reproducing the segfault).

Environment

OS Arch Linux, kernel 6.19.11
CPU 20-core x86_64
RAM 15 GiB
Python 3.14.4 (CPython, system-installed)
ladybug 0.16.0 → 0.16.1
Workload Graph-RAG construction pipeline; ~60K-node graph from a real ~5.6 GB source corpus
Storage tier Single embedded ladybug.Database(<dir>) + ladybug.Connection, lifetime = process

Repro

The pipeline is graph-RAG construction with discrete stages. We checkpoint between stages as NDJSON. Repro shape:

import ladybug
db = ladybug.Database("./ladybug.db")
conn = ladybug.Connection(db)

# Schema (real schema is one Node table with payload JSON + one Edge rel):
conn.execute("CREATE NODE TABLE Node(iri STRING, label STRING, payload STRING, PRIMARY KEY(iri));")
conn.execute("CREATE REL TABLE Edge(FROM Node TO Node, label STRING, cost DOUBLE, properties STRING);")

# Bulk-load a Stage-2 NDJSON checkpoint (~63K nodes + ~63K edges, ~33 MB + ~20 MB on disk):
# (in our wrapper this goes through buffered COPY FROM, ~1 minute)

# Then run an in-process mutation phase:
# - ~17K node upserts (UPDATE/MERGE-style)
# - ~150K edge upserts
# - all small, all interleaved, single connection
#
# Equivalent code on a pure-Python NetworkX backend: ~11 seconds.
# On Ladybug 0.16.0 / 0.16.1: did not observably progress in >13 minutes.

# Process exit:
conn.close()
db.close()    # ← still in the explicit close path
# Process exit triggers Python GC → ~Database() in C++ → SEGV (when reachable)

We have an NDJSON dump that reproduces the loaded state at the start of the mutation phase. Happy to share if useful.

Symptoms

1. Throughput collapse

Same workload, two backends:

Backend Stage-3 mutation phase wall-clock Outcome
NetworkX in-memory 11.1 seconds clean completion
Ladybug 0.16.0 ran 22 minutes, then SIGSEGV in destructor crash
Ladybug 0.16.1 killed at 16 minutes after first progress event never fired indeterminate

Resource trace during the 16-minute Ladybug 0.16.1 run (3-second sampling):

elapsed   %CPU   RSS_MB   VSZ_MB   sys_avail_MB  load1
01:01     111    925      17710    5656          5.55
07:13      99    925      17710    5417          7.27
12:53     104    100      17710    1302          7.25
14:08     104    106      17710    1192          7.25
16:15     103    100      17710    1239          3.98

VSZ steady at ~17 GB; RSS oscillates wildly between ~100 MB and ~1.5 GB; system available memory drops to ~1 GB. Symptom is consistent with mmap'd page eviction under heavy checkpoint accounting, not with steady forward progress.

2. Destructor segfault (0.16.0 only — couldn't reproduce on 0.16.1 because we never got past throughput collapse)

coredumpctl info for the SIGSEGV process:

Signal: 11 (SEGV)
Module libscipy_openblas64_-32a4b2a6.so without build-id.
Stack trace of thread <main>:
#0  ...                                           (_lbug ... + 0x5eb3da)
#1  ...                                           (_lbug ... + 0x66e6c6)
#2  ...                                           (_lbug ... + 0x66e9ce)
#3  ...                                           (_lbug ... + 0x612878)
#4  ...                                           (_lbug ... + 0x5ffd95)
#5  ...                                           (_lbug ... + 0xc78012)
#6  ...                                           (_lbug ... + 0x64a2fb)
#7  ...                                           (_lbug ... + 0x601ef7)
#8  ...                                           (_lbug ... + 0x601a21)
#9  ...                                           (_lbug ... + 0x64b289)
#10 ...                                           (_lbug ... + 0x603f51)
#11 ...                                           (_lbug ... + 0xc815c1)
#12 ...                                           (_lbug ... + 0xc820b5)
#13 ...                                           (_lbug ... + 0xc843df)
#14 lbug::storage::NodeTable::checkpoint(
        ClientContext*, TableCatalogEntry*,
        PageAllocator&, const Transaction*, ulong)   ← crash site
#15 lbug::storage::StorageManager::checkpoint(
        ClientContext*, const Transaction&,
        PageAllocator&,
        const unordered_map<ulong, ulong>&)
#16 ...                                           (_lbug ... + 0xca7050)
#17 ...                                           (_lbug ... + 0x67f476)
#18 ...                                           (_lbug ... + 0x67f5d8)
#19 lbug::main::Database::~Database()             ← destructor
#20 ...
#21 ...
#22 ...
#23 _PyObject_MakeTpCall                          ← Python finalization

Notes on this trace:

  • Frames 0–13 are inside the same shared object — only 14 / 15 / 19 carry symbols (the Python wheel ships stripped, modulo the demangleable-public C++ ABI).
  • Crash is not during the user's mutation calls — it's during ~Database(), after Connection::close() and Database::close() were both called explicitly from Python.
  • DB on disk was 100 MB at moment of crash; ~3.85 GB RSS.
  • BLAS thread workers in the dump are all idle on pthread_cond_wait.

Hypotheses

  1. Per-Cypher checkpoint cost. Looks similar to the unfixed-at-archival Bug: Error: Buffer manager exception: Unable to allocate memory! The buffer pool is full and no memory could be freed! kuzudb/kuzu#4797 heap-retention pattern. If 0.16.x retains a per-statement checkpoint or compaction overhead, ~17K UPDATE-shape statements on ~60K nodes would produce exactly the throughput / page-thrash signature we see.
  2. NodeTable::checkpoint page-allocator boundary bug. The destructor's final flush dereferences a page that was either never allocated or already freed. The deterministic crash site (NodeTable::checkpoint) points at row-storage flush rather than catalog or rel-table flush.

These two hypotheses are likely linked: if a write-amplifying path leaves the storage in a state where the destructor's flush has to do unbounded work, the page allocator may eventually reach a corrupted state.

What we need to make progress

  • Whether 0.16.x has a known per-Cypher-statement checkpoint cost (and a PRAGMA/option to disable mid-run autocheckpoints).
  • A debug-symbols build of the Python wheel, or a build-with-symbols recipe, so we can resolve the unsymbolized frames inside _lbug.so.
  • Whether CHECKPOINT; issued explicitly before db.close() would no-op the destructor flush (we tested the command exists; haven't yet confirmed it short-circuits the destructor path).

We have the full coredump (570 MB compressed) and an NDJSON corpus that reproduces the loaded state. Happy to share a minimal reproducer once we know what shape would be most useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions