Crash diagnostics#13
Conversation
Adds a modular loom::diag subsystem (runtime-only; SDK stays clean):
- breadcrumb.h: per-thread {module,class,phase} set via BreadcrumbScope (stable pointers, signal-safe to read).
- guard.h: diag::guard() wraps module entry points in a breadcrumb + try/catch; a thrown exception is reported and contained instead of terminating the worker thread.
- crash_handler.{h,cpp}: process-global fatal-signal (POSIX sigaction+sigaltstack) / SetUnhandledExceptionFilter (Windows) / std::set_terminate, writing a text crash report (faulting module/phase + signal + build id + raw stack addresses) to <dataDir>/crash before exit. Covers module AND runtime crashes.
- scheduler.cpp: guards preCyclic/cyclic/postCyclic/longRunning + isolated cyclic; on fault sets TaskState::faulted + ModuleState::Error (wiring the previously-dead infra) so the runtime keeps running and skips the faulted module. init() gets a breadcrumb.
- bus.h: dependency-free try/catch around service dispatch (returns a CallResult error; no spdlog/diag dep).
- version.h.in + sdk/CMakeLists.txt: optional git SHA stamp, degrades to "unknown" when git/.git absent (build never fails).
- run.cpp installs the handler; runtime CMake stamps LOOM_BUILD_TYPE.
Unit tests for breadcrumb scope + guard exception path. loom_tests: 112 pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Config-selectable fault (throw|segfault|fpe|abort|loop) at init or on the Nth cyclic tick. Used to exercise the guard (exception isolation) and crash handler (signal/SEH report) end to end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 2 of crash diagnostics. The Windows unhandled-exception filter and
terminate handler now resolve raw return addresses to function + file:line
at crash time, so a dev build's crash report is readable with no offline
step and no tools installed.
- runtime/diag/symbolizer.{h,cpp}: ISymbolizer-style free function backed by
cpptrace, isolated to one TU. cpptrace reads PDB/DWARF directly, so it works
despite modules' hidden visibility and needs no -rdynamic. Never called from
the POSIX signal path (it allocates) - only the Windows UEF, terminate
handler, and (later) the offline --symbolize tool.
- crash_handler.cpp (Windows): symbolize captured frames in the UEF and write
"#i 0x.. symbol / at file:line" per frame.
- cpptrace/0.8.3 wired runtime-only (root + runtime conanfile, find_package,
PRIVATE link). SDK stays glaze-only.
Verified: crasher segfault report pins the fault to doFault @ crasher.cpp:30
<- Crasher::cyclic() @ crasher.cpp:47 through the scheduler guard, fully
symbolized in-process.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Task 4 of crash diagnostics. Faults are now first-class structured records,
not just log lines, and are queryable over HTTP and the bus.
New loom::diag components (modular; diag stays free of DataEngine/Bus deps):
- fault_report.{h,cpp}: FaultReport model + hand-rolled JSON (sections embed as
real nested JSON, build identity + breadcrumb + symbolized frames).
- fault_sink.h: IFaultSink interface the scheduler depends on (injected).
- fault_store.{h,cpp}: thread-safe in-memory ring + <dataDir>/crash/*.json
persistence; scans the dir on construction so prior-run (signal-path) crashes
surface too.
- runtime_fault_sink.{h,cpp}: concrete sink (runtime layer) — on a module-call
exception it captures the module's live data sections, persists a report, and
publishes it on the loom/faults bus topic.
Wiring:
- scheduler: single recordModuleFault() helper routes all four guarded fault
sites (init/cyclic/postCyclic/longRunning/isolated), stamps TaskState
last-fault fields, and notifies the injected sink. setFaultSink() hook.
- RuntimeCore owns the FaultStore + sink and attaches them to the scheduler.
- server: GET /api/faults (list, newest-first) + GET /api/faults/<id> (full
report); module info now exposes faulted + lastFault{Ms,Phase,Msg}.
- crash_handler (Windows): the UEF/terminate path now writes a structured,
symbolized JSON report (off-signal-constrained) so SEH/segfault crashes also
appear in /api/faults on the next start. POSIX keeps the raw text report.
Verified: throw in cyclic -> module Error + runtime survives + report with live
sections (runtime.cycle captured) at /api/faults; segfault -> symbolized JSON
report, picked up by the store on next start. Unit tests for toJson round-trip
and FaultStore record/list/detail/persistence. 114 tests green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a cross-cutting crash diagnostics subsystem to Loom’s runtime: thread-local execution breadcrumbs, exception guarding/quarantining for module entrypoints, persistent fault report storage, and HTTP APIs to surface faults to the UI. It also stamps build identity (SDK version + git SHA + build type) into reports and introduces a dedicated “crasher” module plus unit tests to exercise the diagnostics pipeline.
Changes:
- Introduce
loom::diagcomponents: breadcrumb tracking, guarded calls + fault sink, persistentFaultStore, structuredFaultReport, and a process-globalCrashHandler. - Surface fault status and reports via runtime server JSON (
/api/modulesstats additions, plus/api/faultsand/api/faults/:id). - Add build/runtime plumbing: git SHA stamping in SDK,
cpptracedependency for symbolization, and amodules/crashertest module + new unit tests.
Reviewed changes
Copilot reviewed 32 out of 33 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_diag.cpp | New unit tests for breadcrumbs, guard(), FaultReport JSON, and FaultStore persistence. |
| tests/CMakeLists.txt | Adds diag test and links required runtime diag sources into the test binary. |
| sdk/include/loom/version.h.in | Adds kGitSha build identifier for crash reports. |
| sdk/include/loom/bus.h | Prevents exceptions from service handlers unwinding into callers by catching and returning an error. |
| sdk/CMakeLists.txt | Stamps git SHA (and “dirty” state) into generated SDK version header without failing builds when git metadata is absent. |
| runtime/src/server.cpp | Adds module last-fault fields to module JSON and introduces /api/faults endpoints. |
| runtime/src/scheduler.cpp | Wraps module lifecycle entrypoints with diag guard and records/report faults (quarantine + sink notification). |
| runtime/src/runtime_core.cpp | Instantiates FaultStore and wires a RuntimeFaultSink into the scheduler. |
| runtime/src/run.cpp | Installs process-global crash handler writing reports to <dataDir>/crash. |
| runtime/src/diag/symbolizer.cpp | Implements address→symbol resolution via cpptrace. |
| runtime/src/diag/runtime_fault_sink.cpp | Builds/persists/publishes structured fault reports when a module call throws. |
| runtime/src/diag/fault_store.cpp | Implements persistent fault report registry loaded from <dataDir>/crash. |
| runtime/src/diag/fault_report.cpp | Implements structured report JSON serialization with embedded JSON sections. |
| runtime/src/diag/crash_handler.cpp | Implements POSIX/Windows crash capture for fatal signals/SEH and unhandled exceptions. |
| runtime/src/diag/breadcrumb.cpp | Defines the thread_local breadcrumb storage. |
| runtime/include/loom/scheduler.h | Extends task state with last-fault diagnostics fields and adds fault sink plumbing. |
| runtime/include/loom/runtime_core.h | Exposes faultStore() and stores FaultStore + injected sink. |
| runtime/include/loom/diag/symbolizer.h | Public symbolizer API + SymFrame definition (cpptrace kept out of headers). |
| runtime/include/loom/diag/runtime_fault_sink.h | Declares runtime-layer implementation of IFaultSink. |
| runtime/include/loom/diag/guard.h | Adds exception guard wrapper for module calls with breadcrumb attribution. |
| runtime/include/loom/diag/fault_store.h | Declares persistent fault report store API used by the server. |
| runtime/include/loom/diag/fault_sink.h | Declares the scheduler→runtime fault sink interface and event payload. |
| runtime/include/loom/diag/fault_report.h | Declares structured fault report model and JSON serialization API. |
| runtime/include/loom/diag/crash_handler.h | Declares process-global crash handler install API. |
| runtime/include/loom/diag/breadcrumb.h | Declares breadcrumb model, phases, and RAII scope helper. |
| runtime/conanfile.py | Adds cpptrace dependency to runtime package. |
| runtime/CMakeLists.txt | Adds diag sources, defines build-type macro, and links cpptrace privately. |
| modules/crasher/crasher.cpp | Adds a deliberately-faulting module to exercise crash diagnostics paths. |
| modules/crasher/CMakeLists.txt | Builds and installs the new crasher module plugin. |
| modules/CMakeLists.txt | Includes the new crasher module in the build. |
| conanfile.py | Adds cpptrace to top-level Conan deps (runtime-only usage). |
| CMakeLists.txt | Adds find_package(cpptrace REQUIRED) at the project level. |
| .gitignore | Ignores _crashtest/ output directory. |
| #include <csignal> | ||
| #include <cstdlib> | ||
| #include <execinfo.h> | ||
| #include <unistd.h> | ||
|
|
| // async-signal-safe helpers -------------------------------------------------- | ||
| void sWrite(int fd, const char* s) { ssize_t r = ::write(fd, s, std::strlen(s)); (void)r; } | ||
| void sWriteHex(int fd, uintptr_t v) { | ||
| char buf[2 + sizeof(uintptr_t) * 2]; buf[0] = '0'; buf[1] = 'x'; |
| #include "loom/diag/fault_report.h" | ||
|
|
||
| #include <cstdio> | ||
| #include <string> | ||
|
|
| // Last-fault diagnostics (faulted modules are skipped by the scheduler). | ||
| json += ",\"faulted\":" + std::string(ts->faulted.load() ? "true" : "false"); | ||
| json += ",\"lastFaultMs\":" + std::to_string(ts->lastFaultMs.load()); | ||
| json += ",\"lastFaultPhase\":\"" + | ||
| std::string(diag::phaseName(static_cast<diag::Phase>(ts->lastFaultPhase.load()))) + "\""; | ||
| json += ",\"lastFaultMsg\":\"" + jsonEscapeString(ts->lastFaultMsg) + "\""; |
| void Scheduler::recordModuleFault(TaskState& state, LoadedModule& mod, | ||
| diag::Phase phase, std::string_view message) { | ||
| state.faulted.store(true); | ||
| mod.state = ModuleState::Error; | ||
|
|
||
| const int64_t nowMs = static_cast<int64_t>( | ||
| std::chrono::duration_cast<std::chrono::milliseconds>( | ||
| std::chrono::system_clock::now().time_since_epoch()).count()); | ||
| state.lastFaultMs.store(nowMs); | ||
| state.lastFaultPhase.store(static_cast<uint8_t>(phase)); | ||
| { | ||
| const std::size_t n = std::min(message.size(), sizeof(state.lastFaultMsg) - 1); | ||
| std::memcpy(state.lastFaultMsg, message.data(), n); | ||
| state.lastFaultMsg[n] = '\0'; | ||
| } |
| // Last-fault diagnostics (set when a guarded call throws; surfaced by the | ||
| // server). lastFaultMsg is written under the class thread and read by the | ||
| // server thread — a benign race for a diagnostic string. | ||
| std::atomic<int64_t> lastFaultMs{0}; ///< system_clock ms of last fault (0 = none) | ||
| std::atomic<uint8_t> lastFaultPhase{0}; ///< Phase value at fault | ||
| char lastFaultMsg[256] = {}; |
…t review) Fixes the macOS/Linux CI build failures on PR #13 and the Copilot review. Build breaks: - crash_handler.cpp: include <fcntl.h> for open()/O_* — not transitively available on macOS (clang/libc++), which broke the darwin build. - crash_handler.cpp: size the signal alt-stack with a fixed 64 KiB constant instead of SIGSTKSZ. Since glibc 2.34, SIGSTKSZ expands to a sysconf() call, not a constant, so it can't size a static array (the linux-x64 build error). Correctness (Copilot review): - crash_handler.cpp: sWrite() computed length with std::strlen, which is not async-signal-safe; use a plain loop, restoring the handler's signal-safety. - fault_report.cpp / fault_store.cpp: add <string_view> / <map> (were relying on transitive includes). - lastFaultMsg data race: recordModuleFault() now writes the last-fault fields BEFORE storing `faulted` with memory_order_release; the server acquire-loads `faulted` and reads the fields only when set. A module faults at most once, so the buffer is effectively write-once and safe under this release/acquire pair. Updated the scheduler.h comment to document the protocol (not a "benign race"). Windows build + diag tests green; POSIX paths covered by CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| FaultReport r; | ||
| r.id = ev.moduleId + "-" + std::to_string(nowMs) + "-" + | ||
| std::to_string(seq_.fetch_add(1, std::memory_order_relaxed)); |
| std::lock_guard lock(mx_); | ||
| auto path = crashDir_ / (report.id + ".json"); | ||
| std::ofstream f(path, std::ios::binary | std::ios::trunc); | ||
| if (f) f << json; | ||
| entries_.push_back({summarize(report.id, json), std::move(json)}); |
| statePtr->longRunningThread = std::thread([this, &mod, statePtr]() { | ||
| spdlog::info("Long-running task started for '{}'", mod.id); | ||
| while (statePtr->running.load()) { | ||
| mod.instance->longRunning(); | ||
| bool ok = diag::guard(diag::Phase::LongRunning, mod.id.c_str(), mod.className.c_str(), | ||
| [&]{ mod.instance->longRunning(); }, | ||
| [&](const diag::FaultInfo& f) { | ||
| recordModuleFault(*statePtr, mod, f.phase, f.message); | ||
| }); |
| auto t0 = std::chrono::steady_clock::now(); | ||
| mod.instance->cyclicGuarded(); | ||
| if (!state.faulted.load()) { | ||
| diag::guard(diag::Phase::Cyclic, mod.id.c_str(), mod.className.c_str(), | ||
| [&]{ mod.instance->cyclicGuarded(); }, | ||
| [&](const diag::FaultInfo& f){ |
| void handler(int sig, siginfo_t*, void*) { | ||
| if (g_reporting.test_and_set()) { signal(sig, SIG_DFL); raise(sig); _exit(128 + sig); } | ||
| int fd = ::open(g_reportPath, O_WRONLY | O_CREAT | O_TRUNC, 0644); | ||
| if (fd >= 0) { | ||
| const Breadcrumb& b = tlsBreadcrumb; | ||
| sWrite(fd, "=== Loom crash report ===\nsignal: "); | ||
| sWriteHex(fd, (uintptr_t)sig); | ||
| sWrite(fd, "\nmodule: "); sWrite(fd, b.moduleId ? b.moduleId : "(none/runtime)"); | ||
| sWrite(fd, " class: "); sWrite(fd, b.className ? b.className : "(none)"); | ||
| sWrite(fd, " phase: "); sWrite(fd, phaseName(b.phase)); | ||
| sWrite(fd, "\n"); sWrite(fd, g_buildId); sWrite(fd, "\n"); | ||
| sWrite(fd, "frames (raw addresses — symbolize offline):\n"); | ||
| void* frames[64]; | ||
| int n = backtrace(frames, 64); | ||
| for (int i = 0; i < n; ++i) { sWrite(fd, " "); sWriteHex(fd, (uintptr_t)frames[i]); sWrite(fd, "\n"); } | ||
| ::close(fd); | ||
| } | ||
| signal(sig, SIG_DFL); | ||
| raise(sig); // re-raise for a core dump with default disposition | ||
| _exit(128 + sig); | ||
| } |
… traversal) Five issues from Copilot's second pass: - scheduler stop(): the cyclic/long-running threads hold a raw TaskState*, but stop() erased the TaskState from tasks_ BEFORE joining the threads (join is outside the lock) → use-after-free, made more reachable now the long-running thread calls recordModuleFault on a throw. Extract the owning unique_ptr into a local that outlives the join so the TaskState stays valid until the threads exit. - isolatedLoop(): a quarantined (faulted) module still ran I/O mappings and oscilloscope sampling each tick. Gate those behind !faulted, matching classLoop which skips faulted members entirely. - crash_handler POSIX: the fatal-signal handler called signal()+raise(), and signal() is not async-signal-safe. SA_RESETHAND already resets the disposition to SIG_DFL on entry, so re-raise with kill(getpid(), sig) (async-signal-safe) and drop the signal() call (both the report path and the re-entrant path). - runtime_fault_sink: the report id (→ <crashDir>/<id>.json) embedded the module id, which comes from the HTTP instantiate body — a '/' or '..' could escape the crash dir. Sanitize the module portion to a filename-safe set. - FaultStore::record(): reported success even when the JSON file failed to open. Detect the write failure and log a warning; keep the in-memory entry so a real fault still shows in /api/faults this run (just not persisted across restart). 114 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks Copilot — addressed all 5 findings from the re-review in a45748b:
The earlier 7 comments (fcntl/SIGSTKSZ build breaks, strlen signal-safety, missing includes, lastFaultMsg race) were fixed in f995360; macOS + Linux + Windows builds are green there. 114 tests pass. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 33 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (2)
runtime/src/scheduler.cpp:852
- In the class loop, the return value of diag::guard() is ignored. If cyclicGuarded() throws, recordModuleFault() quarantines the module, but this tick will still run oscilloscope_->sampleModule and update timing/cycle counters for the faulting module. That contradicts the intent of skipping faulted modules and may touch module state after an exception.
// Execute (cyclicGuarded acquires module's runtimeMutex_ so
// server/watch threads can't race on runtime_ reads)
diag::guard(diag::Phase::Cyclic, member.moduleId.c_str(), member.mod->className.c_str(),
[&]{ member.mod->instance->cyclicGuarded(); },
[&](const diag::FaultInfo& f){ faultMember(member, f); });
// Lightweight sampling: use oscilloscope fast-path.
// A member in runner.members is guaranteed alive — removeMember() pauses
// the class and waits for ack BEFORE erasing, so no moduleMutex_ needed.
// We try_lock the module's runtimeMutex_ so we never block the caller;
// missing one sample tick is acceptable.
if (oscilloscope_ && dataEngine_) {
int64_t nowMs = static_cast<int64_t>(
std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::system_clock::now().time_since_epoch()).count());
oscilloscope_->sampleModule(member.moduleId, *dataEngine_,
*member.mod->instance, nowMs);
}
runtime/src/scheduler.cpp:1017
- In isolatedLoop(), after a module is quarantined (state.faulted becomes true), the thread continues ticking forever and still records lastCycleTimeUs/cycleCount/history samples even though no module code runs. This is inconsistent with the class-based loop (which stops incrementing per-module counters once faulted) and can mislead the UI into thinking the module is still executing.
auto t0 = std::chrono::steady_clock::now();
if (!state.faulted.load()) {
diag::guard(diag::Phase::Cyclic, mod.id.c_str(), mod.className.c_str(),
[&]{ mod.instance->cyclicGuarded(); },
[&](const diag::FaultInfo& f){
recordModuleFault(state, mod, f.phase, f.message);
});
}
auto t1 = std::chrono::steady_clock::now();
// Once quarantined, skip I/O mappings + sampling too (the class loop skips
// faulted members entirely) — don't keep touching a module in an error state.
if (!state.faulted.load()) {
// Execute I/O mappings for this isolated module
if (ioMapper_) {
ioMapper_->executeForModule(mod.id);
}
// Sample this isolated module using oscilloscope fast-path.
// The isolated thread owns this module exclusively; state.running guards lifetime.
if (oscilloscope_ && dataEngine_) {
int64_t nowMs = static_cast<int64_t>(
std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::system_clock::now().time_since_epoch()).count());
oscilloscope_->sampleModule(mod.id, *dataEngine_, *mod.instance, nowMs);
}
}
auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0);
state.lastCycleTimeUs.store(elapsed.count());
state.cycleCount.fetch_add(1);
| const std::size_t n = std::min(message.size(), sizeof(state.lastFaultMsg) - 1); | ||
| std::memcpy(state.lastFaultMsg, message.data(), n); | ||
| state.lastFaultMsg[n] = '\0'; | ||
| } | ||
| mod.state = ModuleState::Error; | ||
| state.faulted.store(true, std::memory_order_release); // publish last |
| if (static_cast<unsigned char>(c) < 0x20) { | ||
| char buf[8]; | ||
| std::snprintf(buf, sizeof buf, "\\u%04x", c); | ||
| out += buf; | ||
| } else { |
| sWrite(fd, "frames (raw addresses — symbolize offline):\n"); | ||
| void* frames[64]; | ||
| int n = backtrace(frames, 64); | ||
| for (int i = 0; i < n; ++i) { sWrite(fd, " "); sWriteHex(fd, (uintptr_t)frames[i]); sWrite(fd, "\n"); } | ||
| ::close(fd); |
Add crash diagnostics. Add unhandled exception handlers and create stack and data dumps.