You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a graph of ~60K nodes / ~63K edges (size-of-graph: ~100 MB on disk), Ladybug 0.16.x exhibits two failure modes from a Python embedding (ladybug / _lbug.cpython-314-x86_64-linux-gnu.so):
Per-write throughput collapse during a CPU-bound mutation phase. ~17K node upserts that complete in ~11 seconds on an in-memory NetworkX backend never make observable progress in >13 minutes on Ladybug — page-thrash signatures (RSS oscillating 100 MB ↔ 1.5 GB while VSZ holds 18 GB) suggest mmap'd page churn under per-write checkpoint accounting.
SIGSEGV inside lbug::storage::NodeTable::checkpoint(...) during lbug::main::Database::~Database() — i.e., the destructor's checkpoint flush at process shutdown faults inside the page allocator.
Tested versions: 0.16.0 (initial repro), 0.16.1 (re-tested after upgrade — same throughput collapse; killed at 16 min before reproducing the segfault).
Environment
OS
Arch Linux, kernel 6.19.11
CPU
20-core x86_64
RAM
15 GiB
Python
3.14.4 (CPython, system-installed)
ladybug
0.16.0 → 0.16.1
Workload
Graph-RAG construction pipeline; ~60K-node graph from a real ~5.6 GB source corpus
Storage tier
Single embedded ladybug.Database(<dir>) + ladybug.Connection, lifetime = process
Repro
The pipeline is graph-RAG construction with discrete stages. We checkpoint between stages as NDJSON. Repro shape:
importladybugdb=ladybug.Database("./ladybug.db")
conn=ladybug.Connection(db)
# Schema (real schema is one Node table with payload JSON + one Edge rel):conn.execute("CREATE NODE TABLE Node(iri STRING, label STRING, payload STRING, PRIMARY KEY(iri));")
conn.execute("CREATE REL TABLE Edge(FROM Node TO Node, label STRING, cost DOUBLE, properties STRING);")
# Bulk-load a Stage-2 NDJSON checkpoint (~63K nodes + ~63K edges, ~33 MB + ~20 MB on disk):# (in our wrapper this goes through buffered COPY FROM, ~1 minute)# Then run an in-process mutation phase:# - ~17K node upserts (UPDATE/MERGE-style)# - ~150K edge upserts# - all small, all interleaved, single connection## Equivalent code on a pure-Python NetworkX backend: ~11 seconds.# On Ladybug 0.16.0 / 0.16.1: did not observably progress in >13 minutes.# Process exit:conn.close()
db.close() # ← still in the explicit close path# Process exit triggers Python GC → ~Database() in C++ → SEGV (when reachable)
We have an NDJSON dump that reproduces the loaded state at the start of the mutation phase. Happy to share if useful.
Symptoms
1. Throughput collapse
Same workload, two backends:
Backend
Stage-3 mutation phase wall-clock
Outcome
NetworkX in-memory
11.1 seconds
clean completion
Ladybug 0.16.0
ran 22 minutes, then SIGSEGV in destructor
crash
Ladybug 0.16.1
killed at 16 minutes after first progress event never fired
indeterminate
Resource trace during the 16-minute Ladybug 0.16.1 run (3-second sampling):
VSZ steady at ~17 GB; RSS oscillates wildly between ~100 MB and ~1.5 GB; system available memory drops to ~1 GB. Symptom is consistent with mmap'd page eviction under heavy checkpoint accounting, not with steady forward progress.
2. Destructor segfault (0.16.0 only — couldn't reproduce on 0.16.1 because we never got past throughput collapse)
Frames 0–13 are inside the same shared object — only 14 / 15 / 19 carry symbols (the Python wheel ships stripped, modulo the demangleable-public C++ ABI).
Crash is not during the user's mutation calls — it's during ~Database(), after Connection::close() and Database::close() were both called explicitly from Python.
DB on disk was 100 MB at moment of crash; ~3.85 GB RSS.
BLAS thread workers in the dump are all idle on pthread_cond_wait.
NodeTable::checkpoint page-allocator boundary bug. The destructor's final flush dereferences a page that was either never allocated or already freed. The deterministic crash site (NodeTable::checkpoint) points at row-storage flush rather than catalog or rel-table flush.
These two hypotheses are likely linked: if a write-amplifying path leaves the storage in a state where the destructor's flush has to do unbounded work, the page allocator may eventually reach a corrupted state.
What we need to make progress
Whether 0.16.x has a known per-Cypher-statement checkpoint cost (and a PRAGMA/option to disable mid-run autocheckpoints).
A debug-symbols build of the Python wheel, or a build-with-symbols recipe, so we can resolve the unsymbolized frames inside _lbug.so.
Whether CHECKPOINT; issued explicitly before db.close() would no-op the destructor flush (we tested the command exists; haven't yet confirmed it short-circuits the destructor path).
We have the full coredump (570 MB compressed) and an NDJSON corpus that reproduces the loaded state. Happy to share a minimal reproducer once we know what shape would be most useful.
TL;DR
On a graph of ~60K nodes / ~63K edges (size-of-graph: ~100 MB on disk), Ladybug 0.16.x exhibits two failure modes from a Python embedding (
ladybug/_lbug.cpython-314-x86_64-linux-gnu.so):SIGSEGVinsidelbug::storage::NodeTable::checkpoint(...)duringlbug::main::Database::~Database()— i.e., the destructor's checkpoint flush at process shutdown faults inside the page allocator.Tested versions: 0.16.0 (initial repro), 0.16.1 (re-tested after upgrade — same throughput collapse; killed at 16 min before reproducing the segfault).
Environment
ladybugladybug.Database(<dir>)+ladybug.Connection, lifetime = processRepro
The pipeline is graph-RAG construction with discrete stages. We checkpoint between stages as NDJSON. Repro shape:
We have an NDJSON dump that reproduces the loaded state at the start of the mutation phase. Happy to share if useful.
Symptoms
1. Throughput collapse
Same workload, two backends:
Resource trace during the 16-minute Ladybug 0.16.1 run (3-second sampling):
VSZ steady at ~17 GB; RSS oscillates wildly between ~100 MB and ~1.5 GB; system available memory drops to ~1 GB. Symptom is consistent with mmap'd page eviction under heavy checkpoint accounting, not with steady forward progress.
2. Destructor segfault (0.16.0 only — couldn't reproduce on 0.16.1 because we never got past throughput collapse)
coredumpctl infofor the SIGSEGV process:Notes on this trace:
~Database(), afterConnection::close()andDatabase::close()were both called explicitly from Python.pthread_cond_wait.Hypotheses
NodeTable::checkpoint) points at row-storage flush rather than catalog or rel-table flush.These two hypotheses are likely linked: if a write-amplifying path leaves the storage in a state where the destructor's flush has to do unbounded work, the page allocator may eventually reach a corrupted state.
What we need to make progress
PRAGMA/option to disable mid-run autocheckpoints)._lbug.so.CHECKPOINT;issued explicitly beforedb.close()would no-op the destructor flush (we tested the command exists; haven't yet confirmed it short-circuits the destructor path).We have the full coredump (570 MB compressed) and an NDJSON corpus that reproduces the loaded state. Happy to share a minimal reproducer once we know what shape would be most useful.