Database::~Database() segfault during checkpoint flush + per-write throughput collapse on ~60K-node graphs (0.16.0 and 0.16.1)

## TL;DR

On a graph of ~60K nodes / ~63K edges (size-of-graph: ~100 MB on disk), Ladybug 0.16.x exhibits two failure modes from a Python embedding (`ladybug` / `_lbug.cpython-314-x86_64-linux-gnu.so`):

1. **Per-write throughput collapse during a CPU-bound mutation phase.** ~17K node upserts that complete in **~11 seconds** on an in-memory NetworkX backend never make observable progress in **>13 minutes** on Ladybug — page-thrash signatures (RSS oscillating 100 MB ↔ 1.5 GB while VSZ holds 18 GB) suggest mmap'd page churn under per-write checkpoint accounting.
2. **`SIGSEGV` inside `lbug::storage::NodeTable::checkpoint(...)` during `lbug::main::Database::~Database()`** — i.e., the destructor's checkpoint flush at process shutdown faults inside the page allocator.

Tested versions: **0.16.0** (initial repro), **0.16.1** (re-tested after upgrade — same throughput collapse; killed at 16 min before reproducing the segfault).

## Environment

| | |
|---|---|
| OS | Arch Linux, kernel 6.19.11 |
| CPU | 20-core x86_64 |
| RAM | 15 GiB |
| Python | 3.14.4 (CPython, system-installed) |
| `ladybug` | 0.16.0 → 0.16.1 |
| Workload | Graph-RAG construction pipeline; ~60K-node graph from a real ~5.6 GB source corpus |
| Storage tier | Single embedded `ladybug.Database(<dir>)` + `ladybug.Connection`, lifetime = process |

## Repro

The pipeline is graph-RAG construction with discrete stages. We checkpoint between stages as NDJSON. Repro shape:

```python
import ladybug
db = ladybug.Database("./ladybug.db")
conn = ladybug.Connection(db)

# Schema (real schema is one Node table with payload JSON + one Edge rel):
conn.execute("CREATE NODE TABLE Node(iri STRING, label STRING, payload STRING, PRIMARY KEY(iri));")
conn.execute("CREATE REL TABLE Edge(FROM Node TO Node, label STRING, cost DOUBLE, properties STRING);")

# Bulk-load a Stage-2 NDJSON checkpoint (~63K nodes + ~63K edges, ~33 MB + ~20 MB on disk):
# (in our wrapper this goes through buffered COPY FROM, ~1 minute)

# Then run an in-process mutation phase:
# - ~17K node upserts (UPDATE/MERGE-style)
# - ~150K edge upserts
# - all small, all interleaved, single connection
#
# Equivalent code on a pure-Python NetworkX backend: ~11 seconds.
# On Ladybug 0.16.0 / 0.16.1: did not observably progress in >13 minutes.

# Process exit:
conn.close()
db.close()    # ← still in the explicit close path
# Process exit triggers Python GC → ~Database() in C++ → SEGV (when reachable)
```

We have an NDJSON dump that reproduces the loaded state at the start of the mutation phase. Happy to share if useful.

## Symptoms

### 1. Throughput collapse

Same workload, two backends:

| Backend | Stage-3 mutation phase wall-clock | Outcome |
|---|---|---|
| NetworkX in-memory | **11.1 seconds** | clean completion |
| Ladybug 0.16.0 | ran 22 minutes, then SIGSEGV in destructor | crash |
| Ladybug 0.16.1 | killed at 16 minutes after first progress event never fired | indeterminate |

Resource trace during the 16-minute Ladybug 0.16.1 run (3-second sampling):

```
elapsed   %CPU   RSS_MB   VSZ_MB   sys_avail_MB  load1
01:01     111    925      17710    5656          5.55
07:13      99    925      17710    5417          7.27
12:53     104    100      17710    1302          7.25
14:08     104    106      17710    1192          7.25
16:15     103    100      17710    1239          3.98
```

VSZ steady at ~17 GB; RSS oscillates wildly between ~100 MB and ~1.5 GB; system available memory drops to ~1 GB. Symptom is consistent with mmap'd page eviction under heavy checkpoint accounting, not with steady forward progress.

### 2. Destructor segfault (0.16.0 only — couldn't reproduce on 0.16.1 because we never got past throughput collapse)

`coredumpctl info` for the SIGSEGV process:

```
Signal: 11 (SEGV)
Module libscipy_openblas64_-32a4b2a6.so without build-id.
Stack trace of thread <main>:
#0  ...                                           (_lbug ... + 0x5eb3da)
#1  ...                                           (_lbug ... + 0x66e6c6)
#2  ...                                           (_lbug ... + 0x66e9ce)
#3  ...                                           (_lbug ... + 0x612878)
#4  ...                                           (_lbug ... + 0x5ffd95)
#5  ...                                           (_lbug ... + 0xc78012)
#6  ...                                           (_lbug ... + 0x64a2fb)
#7  ...                                           (_lbug ... + 0x601ef7)
#8  ...                                           (_lbug ... + 0x601a21)
#9  ...                                           (_lbug ... + 0x64b289)
#10 ...                                           (_lbug ... + 0x603f51)
#11 ...                                           (_lbug ... + 0xc815c1)
#12 ...                                           (_lbug ... + 0xc820b5)
#13 ...                                           (_lbug ... + 0xc843df)
#14 lbug::storage::NodeTable::checkpoint(
        ClientContext*, TableCatalogEntry*,
        PageAllocator&, const Transaction*, ulong)   ← crash site
#15 lbug::storage::StorageManager::checkpoint(
        ClientContext*, const Transaction&,
        PageAllocator&,
        const unordered_map<ulong, ulong>&)
#16 ...                                           (_lbug ... + 0xca7050)
#17 ...                                           (_lbug ... + 0x67f476)
#18 ...                                           (_lbug ... + 0x67f5d8)
#19 lbug::main::Database::~Database()             ← destructor
#20 ...
#21 ...
#22 ...
#23 _PyObject_MakeTpCall                          ← Python finalization
```

Notes on this trace:

- Frames 0–13 are inside the same shared object — only 14 / 15 / 19 carry symbols (the Python wheel ships stripped, modulo the demangleable-public C++ ABI).
- Crash is **not** during the user's mutation calls — it's during `~Database()`, after `Connection::close()` and `Database::close()` were both called explicitly from Python.
- DB on disk was 100 MB at moment of crash; ~3.85 GB RSS.
- BLAS thread workers in the dump are all idle on `pthread_cond_wait`.

## Hypotheses

1. **Per-Cypher checkpoint cost.** Looks similar to the unfixed-at-archival kuzudb/kuzu#4797 heap-retention pattern. If 0.16.x retains a per-statement checkpoint or compaction overhead, ~17K UPDATE-shape statements on ~60K nodes would produce exactly the throughput / page-thrash signature we see.
2. **NodeTable::checkpoint page-allocator boundary bug.** The destructor's final flush dereferences a page that was either never allocated or already freed. The deterministic crash site (`NodeTable::checkpoint`) points at row-storage flush rather than catalog or rel-table flush.

These two hypotheses are likely linked: if a write-amplifying path leaves the storage in a state where the destructor's flush has to do unbounded work, the page allocator may eventually reach a corrupted state.

## What we need to make progress

- Whether 0.16.x has a known per-Cypher-statement checkpoint cost (and a `PRAGMA`/option to disable mid-run autocheckpoints).
- A debug-symbols build of the Python wheel, or a build-with-symbols recipe, so we can resolve the unsymbolized frames inside `_lbug.so`.
- Whether `CHECKPOINT;` issued explicitly before `db.close()` would no-op the destructor flush (we tested the command exists; haven't yet confirmed it short-circuits the destructor path).

We have the full coredump (570 MB compressed) and an NDJSON corpus that reproduces the loaded state. Happy to share a minimal reproducer once we know what shape would be most useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database::~Database() segfault during checkpoint flush + per-write throughput collapse on ~60K-node graphs (0.16.0 and 0.16.1) #452

TL;DR

Environment

Repro

Symptoms

1. Throughput collapse

2. Destructor segfault (0.16.0 only — couldn't reproduce on 0.16.1 because we never got past throughput collapse)

Hypotheses

What we need to make progress

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development


OS	Arch Linux, kernel 6.19.11
CPU	20-core x86_64
RAM	15 GiB
Python	3.14.4 (CPython, system-installed)
`ladybug`	0.16.0 → 0.16.1
Workload	Graph-RAG construction pipeline; ~60K-node graph from a real ~5.6 GB source corpus
Storage tier	Single embedded `ladybug.Database(<dir>)` + `ladybug.Connection`, lifetime = process

Backend	Stage-3 mutation phase wall-clock	Outcome
NetworkX in-memory	11.1 seconds	clean completion
Ladybug 0.16.0	ran 22 minutes, then SIGSEGV in destructor	crash
Ladybug 0.16.1	killed at 16 minutes after first progress event never fired	indeterminate

Database::~Database() segfault during checkpoint flush + per-write throughput collapse on ~60K-node graphs (0.16.0 and 0.16.1) #452

Description

TL;DR

Environment

Repro

Symptoms

1. Throughput collapse

2. Destructor segfault (0.16.0 only — couldn't reproduce on 0.16.1 because we never got past throughput collapse)

Hypotheses

What we need to make progress

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions