Skip to content

Crash diagnostics#13

Merged
Joshpolansky merged 7 commits into
developfrom
feat/crash-diagnostics
Jun 19, 2026
Merged

Crash diagnostics#13
Joshpolansky merged 7 commits into
developfrom
feat/crash-diagnostics

Conversation

@Joshpolansky

Copy link
Copy Markdown
Owner

Add crash diagnostics. Add unhandled exception handlers and create stack and data dumps.

Joshpolansky and others added 5 commits June 19, 2026 12:32
Adds a modular loom::diag subsystem (runtime-only; SDK stays clean):
- breadcrumb.h: per-thread {module,class,phase} set via BreadcrumbScope (stable pointers, signal-safe to read).
- guard.h: diag::guard() wraps module entry points in a breadcrumb + try/catch; a thrown exception is reported and contained instead of terminating the worker thread.
- crash_handler.{h,cpp}: process-global fatal-signal (POSIX sigaction+sigaltstack) / SetUnhandledExceptionFilter (Windows) / std::set_terminate, writing a text crash report (faulting module/phase + signal + build id + raw stack addresses) to <dataDir>/crash before exit. Covers module AND runtime crashes.
- scheduler.cpp: guards preCyclic/cyclic/postCyclic/longRunning + isolated cyclic; on fault sets TaskState::faulted + ModuleState::Error (wiring the previously-dead infra) so the runtime keeps running and skips the faulted module. init() gets a breadcrumb.
- bus.h: dependency-free try/catch around service dispatch (returns a CallResult error; no spdlog/diag dep).
- version.h.in + sdk/CMakeLists.txt: optional git SHA stamp, degrades to "unknown" when git/.git absent (build never fails).
- run.cpp installs the handler; runtime CMake stamps LOOM_BUILD_TYPE.
Unit tests for breadcrumb scope + guard exception path. loom_tests: 112 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Config-selectable fault (throw|segfault|fpe|abort|loop) at init or on the Nth cyclic tick. Used to exercise the guard (exception isolation) and crash handler (signal/SEH report) end to end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 2 of crash diagnostics. The Windows unhandled-exception filter and
terminate handler now resolve raw return addresses to function + file:line
at crash time, so a dev build's crash report is readable with no offline
step and no tools installed.

- runtime/diag/symbolizer.{h,cpp}: ISymbolizer-style free function backed by
  cpptrace, isolated to one TU. cpptrace reads PDB/DWARF directly, so it works
  despite modules' hidden visibility and needs no -rdynamic. Never called from
  the POSIX signal path (it allocates) - only the Windows UEF, terminate
  handler, and (later) the offline --symbolize tool.
- crash_handler.cpp (Windows): symbolize captured frames in the UEF and write
  "#i 0x.. symbol / at file:line" per frame.
- cpptrace/0.8.3 wired runtime-only (root + runtime conanfile, find_package,
  PRIVATE link). SDK stays glaze-only.

Verified: crasher segfault report pins the fault to doFault @ crasher.cpp:30
<- Crasher::cyclic() @ crasher.cpp:47 through the scheduler guard, fully
symbolized in-process.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Task 4 of crash diagnostics. Faults are now first-class structured records,
not just log lines, and are queryable over HTTP and the bus.

New loom::diag components (modular; diag stays free of DataEngine/Bus deps):
- fault_report.{h,cpp}: FaultReport model + hand-rolled JSON (sections embed as
  real nested JSON, build identity + breadcrumb + symbolized frames).
- fault_sink.h: IFaultSink interface the scheduler depends on (injected).
- fault_store.{h,cpp}: thread-safe in-memory ring + <dataDir>/crash/*.json
  persistence; scans the dir on construction so prior-run (signal-path) crashes
  surface too.
- runtime_fault_sink.{h,cpp}: concrete sink (runtime layer) — on a module-call
  exception it captures the module's live data sections, persists a report, and
  publishes it on the loom/faults bus topic.

Wiring:
- scheduler: single recordModuleFault() helper routes all four guarded fault
  sites (init/cyclic/postCyclic/longRunning/isolated), stamps TaskState
  last-fault fields, and notifies the injected sink. setFaultSink() hook.
- RuntimeCore owns the FaultStore + sink and attaches them to the scheduler.
- server: GET /api/faults (list, newest-first) + GET /api/faults/<id> (full
  report); module info now exposes faulted + lastFault{Ms,Phase,Msg}.
- crash_handler (Windows): the UEF/terminate path now writes a structured,
  symbolized JSON report (off-signal-constrained) so SEH/segfault crashes also
  appear in /api/faults on the next start. POSIX keeps the raw text report.

Verified: throw in cyclic -> module Error + runtime survives + report with live
sections (runtime.cycle captured) at /api/faults; segfault -> symbolized JSON
report, picked up by the store on next start. Unit tests for toJson round-trip
and FaultStore record/list/detail/persistence. 114 tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 19, 2026 20:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a cross-cutting crash diagnostics subsystem to Loom’s runtime: thread-local execution breadcrumbs, exception guarding/quarantining for module entrypoints, persistent fault report storage, and HTTP APIs to surface faults to the UI. It also stamps build identity (SDK version + git SHA + build type) into reports and introduces a dedicated “crasher” module plus unit tests to exercise the diagnostics pipeline.

Changes:

  • Introduce loom::diag components: breadcrumb tracking, guarded calls + fault sink, persistent FaultStore, structured FaultReport, and a process-global CrashHandler.
  • Surface fault status and reports via runtime server JSON (/api/modules stats additions, plus /api/faults and /api/faults/:id).
  • Add build/runtime plumbing: git SHA stamping in SDK, cpptrace dependency for symbolization, and a modules/crasher test module + new unit tests.

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_diag.cpp New unit tests for breadcrumbs, guard(), FaultReport JSON, and FaultStore persistence.
tests/CMakeLists.txt Adds diag test and links required runtime diag sources into the test binary.
sdk/include/loom/version.h.in Adds kGitSha build identifier for crash reports.
sdk/include/loom/bus.h Prevents exceptions from service handlers unwinding into callers by catching and returning an error.
sdk/CMakeLists.txt Stamps git SHA (and “dirty” state) into generated SDK version header without failing builds when git metadata is absent.
runtime/src/server.cpp Adds module last-fault fields to module JSON and introduces /api/faults endpoints.
runtime/src/scheduler.cpp Wraps module lifecycle entrypoints with diag guard and records/report faults (quarantine + sink notification).
runtime/src/runtime_core.cpp Instantiates FaultStore and wires a RuntimeFaultSink into the scheduler.
runtime/src/run.cpp Installs process-global crash handler writing reports to <dataDir>/crash.
runtime/src/diag/symbolizer.cpp Implements address→symbol resolution via cpptrace.
runtime/src/diag/runtime_fault_sink.cpp Builds/persists/publishes structured fault reports when a module call throws.
runtime/src/diag/fault_store.cpp Implements persistent fault report registry loaded from <dataDir>/crash.
runtime/src/diag/fault_report.cpp Implements structured report JSON serialization with embedded JSON sections.
runtime/src/diag/crash_handler.cpp Implements POSIX/Windows crash capture for fatal signals/SEH and unhandled exceptions.
runtime/src/diag/breadcrumb.cpp Defines the thread_local breadcrumb storage.
runtime/include/loom/scheduler.h Extends task state with last-fault diagnostics fields and adds fault sink plumbing.
runtime/include/loom/runtime_core.h Exposes faultStore() and stores FaultStore + injected sink.
runtime/include/loom/diag/symbolizer.h Public symbolizer API + SymFrame definition (cpptrace kept out of headers).
runtime/include/loom/diag/runtime_fault_sink.h Declares runtime-layer implementation of IFaultSink.
runtime/include/loom/diag/guard.h Adds exception guard wrapper for module calls with breadcrumb attribution.
runtime/include/loom/diag/fault_store.h Declares persistent fault report store API used by the server.
runtime/include/loom/diag/fault_sink.h Declares the scheduler→runtime fault sink interface and event payload.
runtime/include/loom/diag/fault_report.h Declares structured fault report model and JSON serialization API.
runtime/include/loom/diag/crash_handler.h Declares process-global crash handler install API.
runtime/include/loom/diag/breadcrumb.h Declares breadcrumb model, phases, and RAII scope helper.
runtime/conanfile.py Adds cpptrace dependency to runtime package.
runtime/CMakeLists.txt Adds diag sources, defines build-type macro, and links cpptrace privately.
modules/crasher/crasher.cpp Adds a deliberately-faulting module to exercise crash diagnostics paths.
modules/crasher/CMakeLists.txt Builds and installs the new crasher module plugin.
modules/CMakeLists.txt Includes the new crasher module in the build.
conanfile.py Adds cpptrace to top-level Conan deps (runtime-only usage).
CMakeLists.txt Adds find_package(cpptrace REQUIRED) at the project level.
.gitignore Ignores _crashtest/ output directory.

Comment on lines +120 to +124
#include <csignal>
#include <cstdlib>
#include <execinfo.h>
#include <unistd.h>

Comment on lines +128 to +131
// async-signal-safe helpers --------------------------------------------------
void sWrite(int fd, const char* s) { ssize_t r = ::write(fd, s, std::strlen(s)); (void)r; }
void sWriteHex(int fd, uintptr_t v) {
char buf[2 + sizeof(uintptr_t) * 2]; buf[0] = '0'; buf[1] = 'x';
Comment on lines +1 to +5
#include "loom/diag/fault_report.h"

#include <cstdio>
#include <string>

Comment thread runtime/src/diag/fault_store.cpp
Comment thread runtime/src/server.cpp Outdated
Comment on lines +278 to +283
// Last-fault diagnostics (faulted modules are skipped by the scheduler).
json += ",\"faulted\":" + std::string(ts->faulted.load() ? "true" : "false");
json += ",\"lastFaultMs\":" + std::to_string(ts->lastFaultMs.load());
json += ",\"lastFaultPhase\":\"" +
std::string(diag::phaseName(static_cast<diag::Phase>(ts->lastFaultPhase.load()))) + "\"";
json += ",\"lastFaultMsg\":\"" + jsonEscapeString(ts->lastFaultMsg) + "\"";
Comment thread runtime/src/scheduler.cpp
Comment on lines +171 to +185
void Scheduler::recordModuleFault(TaskState& state, LoadedModule& mod,
diag::Phase phase, std::string_view message) {
state.faulted.store(true);
mod.state = ModuleState::Error;

const int64_t nowMs = static_cast<int64_t>(
std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::system_clock::now().time_since_epoch()).count());
state.lastFaultMs.store(nowMs);
state.lastFaultPhase.store(static_cast<uint8_t>(phase));
{
const std::size_t n = std::min(message.size(), sizeof(state.lastFaultMsg) - 1);
std::memcpy(state.lastFaultMsg, message.data(), n);
state.lastFaultMsg[n] = '\0';
}
Comment thread runtime/include/loom/scheduler.h Outdated
Comment on lines +80 to +85
// Last-fault diagnostics (set when a guarded call throws; surfaced by the
// server). lastFaultMsg is written under the class thread and read by the
// server thread — a benign race for a diagnostic string.
std::atomic<int64_t> lastFaultMs{0}; ///< system_clock ms of last fault (0 = none)
std::atomic<uint8_t> lastFaultPhase{0}; ///< Phase value at fault
char lastFaultMsg[256] = {};
…t review)

Fixes the macOS/Linux CI build failures on PR #13 and the Copilot review.

Build breaks:
- crash_handler.cpp: include <fcntl.h> for open()/O_* — not transitively
  available on macOS (clang/libc++), which broke the darwin build.
- crash_handler.cpp: size the signal alt-stack with a fixed 64 KiB constant
  instead of SIGSTKSZ. Since glibc 2.34, SIGSTKSZ expands to a sysconf() call,
  not a constant, so it can't size a static array (the linux-x64 build error).

Correctness (Copilot review):
- crash_handler.cpp: sWrite() computed length with std::strlen, which is not
  async-signal-safe; use a plain loop, restoring the handler's signal-safety.
- fault_report.cpp / fault_store.cpp: add <string_view> / <map> (were relying
  on transitive includes).
- lastFaultMsg data race: recordModuleFault() now writes the last-fault fields
  BEFORE storing `faulted` with memory_order_release; the server acquire-loads
  `faulted` and reads the fields only when set. A module faults at most once, so
  the buffer is effectively write-once and safe under this release/acquire pair.
  Updated the scheduler.h comment to document the protocol (not a "benign race").

Windows build + diag tests green; POSIX paths covered by CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 33 changed files in this pull request and generated 5 comments.

Comment on lines +39 to +41
FaultReport r;
r.id = ev.moduleId + "-" + std::to_string(nowMs) + "-" +
std::to_string(seq_.fetch_add(1, std::memory_order_relaxed));
Comment on lines +103 to +107
std::lock_guard lock(mx_);
auto path = crashDir_ / (report.id + ".json");
std::ofstream f(path, std::ios::binary | std::ios::trunc);
if (f) f << json;
entries_.push_back({summarize(report.id, json), std::move(json)});
Comment thread runtime/src/scheduler.cpp
Comment on lines +283 to +290
statePtr->longRunningThread = std::thread([this, &mod, statePtr]() {
spdlog::info("Long-running task started for '{}'", mod.id);
while (statePtr->running.load()) {
mod.instance->longRunning();
bool ok = diag::guard(diag::Phase::LongRunning, mod.id.c_str(), mod.className.c_str(),
[&]{ mod.instance->longRunning(); },
[&](const diag::FaultInfo& f) {
recordModuleFault(*statePtr, mod, f.phase, f.message);
});
Comment thread runtime/src/scheduler.cpp
Comment on lines 976 to +980
auto t0 = std::chrono::steady_clock::now();
mod.instance->cyclicGuarded();
if (!state.faulted.load()) {
diag::guard(diag::Phase::Cyclic, mod.id.c_str(), mod.className.c_str(),
[&]{ mod.instance->cyclicGuarded(); },
[&](const diag::FaultInfo& f){
Comment on lines +150 to +170
void handler(int sig, siginfo_t*, void*) {
if (g_reporting.test_and_set()) { signal(sig, SIG_DFL); raise(sig); _exit(128 + sig); }
int fd = ::open(g_reportPath, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd >= 0) {
const Breadcrumb& b = tlsBreadcrumb;
sWrite(fd, "=== Loom crash report ===\nsignal: ");
sWriteHex(fd, (uintptr_t)sig);
sWrite(fd, "\nmodule: "); sWrite(fd, b.moduleId ? b.moduleId : "(none/runtime)");
sWrite(fd, " class: "); sWrite(fd, b.className ? b.className : "(none)");
sWrite(fd, " phase: "); sWrite(fd, phaseName(b.phase));
sWrite(fd, "\n"); sWrite(fd, g_buildId); sWrite(fd, "\n");
sWrite(fd, "frames (raw addresses — symbolize offline):\n");
void* frames[64];
int n = backtrace(frames, 64);
for (int i = 0; i < n; ++i) { sWrite(fd, " "); sWriteHex(fd, (uintptr_t)frames[i]); sWrite(fd, "\n"); }
::close(fd);
}
signal(sig, SIG_DFL);
raise(sig); // re-raise for a core dump with default disposition
_exit(128 + sig);
}
… traversal)

Five issues from Copilot's second pass:

- scheduler stop(): the cyclic/long-running threads hold a raw TaskState*, but
  stop() erased the TaskState from tasks_ BEFORE joining the threads (join is
  outside the lock) → use-after-free, made more reachable now the long-running
  thread calls recordModuleFault on a throw. Extract the owning unique_ptr into a
  local that outlives the join so the TaskState stays valid until the threads exit.
- isolatedLoop(): a quarantined (faulted) module still ran I/O mappings and
  oscilloscope sampling each tick. Gate those behind !faulted, matching classLoop
  which skips faulted members entirely.
- crash_handler POSIX: the fatal-signal handler called signal()+raise(), and
  signal() is not async-signal-safe. SA_RESETHAND already resets the disposition
  to SIG_DFL on entry, so re-raise with kill(getpid(), sig) (async-signal-safe)
  and drop the signal() call (both the report path and the re-entrant path).
- runtime_fault_sink: the report id (→ <crashDir>/<id>.json) embedded the module
  id, which comes from the HTTP instantiate body — a '/' or '..' could escape the
  crash dir. Sanitize the module portion to a filename-safe set.
- FaultStore::record(): reported success even when the JSON file failed to open.
  Detect the write failure and log a warning; keep the in-memory entry so a real
  fault still shows in /api/faults this run (just not persisted across restart).

114 tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Joshpolansky

Copy link
Copy Markdown
Owner Author

Thanks Copilot — addressed all 5 findings from the re-review in a45748b:

  1. scheduler.cpp long-running thread UAFstop() erased the TaskState before joining the threads (join is outside the lock). Now the owning unique_ptr is moved into a local that outlives the join, so the TaskState stays valid until the threads have exited. (stopAll() already joined before clearing.)
  2. scheduler.cpp isolatedLoop quarantine — I/O mappings + oscilloscope sampling now skip a faulted module too (gated behind !faulted), matching classLoop.
  3. crash_handler.cpp async-signal-safety — dropped the non-async-signal-safe signal() call (SA_RESETHAND already resets the disposition) and re-raise via kill(getpid(), sig) on both the report path and the re-entrant path.
  4. runtime_fault_sink.cpp path traversal — the report id (→ <crashDir>/<id>.json) now sanitizes the module portion to a filename-safe set, so a / or .. from the instantiate body can't escape the crash dir.
  5. fault_store.cpp record() silent persist failure — a failed file open is now detected and logged. I deliberately keep the in-memory entry so a fault that actually happened still appears in /api/faults for the session (the warning makes the non-persistence explicit) rather than dropping it entirely.

The earlier 7 comments (fcntl/SIGSTKSZ build breaks, strlen signal-safety, missing includes, lastFaultMsg race) were fixed in f995360; macOS + Linux + Windows builds are green there. 114 tests pass.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 33 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

runtime/src/scheduler.cpp:852

  • In the class loop, the return value of diag::guard() is ignored. If cyclicGuarded() throws, recordModuleFault() quarantines the module, but this tick will still run oscilloscope_->sampleModule and update timing/cycle counters for the faulting module. That contradicts the intent of skipping faulted modules and may touch module state after an exception.
            // Execute (cyclicGuarded acquires module's runtimeMutex_ so
            // server/watch threads can't race on runtime_ reads)
            diag::guard(diag::Phase::Cyclic, member.moduleId.c_str(), member.mod->className.c_str(),
                        [&]{ member.mod->instance->cyclicGuarded(); },
                        [&](const diag::FaultInfo& f){ faultMember(member, f); });

            // Lightweight sampling: use oscilloscope fast-path.
            // A member in runner.members is guaranteed alive — removeMember() pauses
            // the class and waits for ack BEFORE erasing, so no moduleMutex_ needed.
            // We try_lock the module's runtimeMutex_ so we never block the caller;
            // missing one sample tick is acceptable.
            if (oscilloscope_ && dataEngine_) {
                int64_t nowMs = static_cast<int64_t>(
                    std::chrono::duration_cast<std::chrono::milliseconds>(
                        std::chrono::system_clock::now().time_since_epoch()).count());
                oscilloscope_->sampleModule(member.moduleId, *dataEngine_,
                                            *member.mod->instance, nowMs);
            }

runtime/src/scheduler.cpp:1017

  • In isolatedLoop(), after a module is quarantined (state.faulted becomes true), the thread continues ticking forever and still records lastCycleTimeUs/cycleCount/history samples even though no module code runs. This is inconsistent with the class-based loop (which stops incrementing per-module counters once faulted) and can mislead the UI into thinking the module is still executing.
        auto t0 = std::chrono::steady_clock::now();
        if (!state.faulted.load()) {
            diag::guard(diag::Phase::Cyclic, mod.id.c_str(), mod.className.c_str(),
                        [&]{ mod.instance->cyclicGuarded(); },
                        [&](const diag::FaultInfo& f){
                            recordModuleFault(state, mod, f.phase, f.message);
                        });
        }
        auto t1 = std::chrono::steady_clock::now();

        // Once quarantined, skip I/O mappings + sampling too (the class loop skips
        // faulted members entirely) — don't keep touching a module in an error state.
        if (!state.faulted.load()) {
            // Execute I/O mappings for this isolated module
            if (ioMapper_) {
                ioMapper_->executeForModule(mod.id);
            }

            // Sample this isolated module using oscilloscope fast-path.
            // The isolated thread owns this module exclusively; state.running guards lifetime.
            if (oscilloscope_ && dataEngine_) {
                int64_t nowMs = static_cast<int64_t>(
                    std::chrono::duration_cast<std::chrono::milliseconds>(
                        std::chrono::system_clock::now().time_since_epoch()).count());
                oscilloscope_->sampleModule(mod.id, *dataEngine_, *mod.instance, nowMs);
            }
        }

        auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0);
        state.lastCycleTimeUs.store(elapsed.count());
        state.cycleCount.fetch_add(1);

Comment thread runtime/src/scheduler.cpp
Comment on lines +183 to +188
const std::size_t n = std::min(message.size(), sizeof(state.lastFaultMsg) - 1);
std::memcpy(state.lastFaultMsg, message.data(), n);
state.lastFaultMsg[n] = '\0';
}
mod.state = ModuleState::Error;
state.faulted.store(true, std::memory_order_release); // publish last
Comment on lines +31 to +35
if (static_cast<unsigned char>(c) < 0x20) {
char buf[8];
std::snprintf(buf, sizeof buf, "\\u%04x", c);
out += buf;
} else {
Comment on lines +164 to +168
sWrite(fd, "frames (raw addresses — symbolize offline):\n");
void* frames[64];
int n = backtrace(frames, 64);
for (int i = 0; i < n; ++i) { sWrite(fd, " "); sWriteHex(fd, (uintptr_t)frames[i]); sWrite(fd, "\n"); }
::close(fd);
@Joshpolansky Joshpolansky merged commit 53c6f77 into develop Jun 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants