⚗️ [RUM-15520] prototype child process monitoring#96
Draft
bcaudan wants to merge 37 commits into
Draft
Conversation
Research documents for evaluating how the Electron SDK could monitor child processes. Includes priority matrix across 6 mechanisms and stub for prototype findings.
All tests pass. render-process-gone works for abnormal terminations only (no noise on normal close). child-process-gone does not cover Node.js child_process. getAppMetrics provides rich data with zero instrumentation overhead.
Key findings: __importStar blocks patching (must use require()), exec→execFile internal chain causes dedup issues, diagnostics_channel not available for child_process in Node 22. Documents open questions for bundler testing and production patching approach.
All scenarios work. Fork wrapper, MessagePort telemetry channel, parentPort error forwarding, and child-process-gone all validated. Key finding: crash vs exit(1) produce same child-process-gone reason, parentPort error forwarding is the only way to distinguish them.
High-level overview of how each instrumentation works: - spawn: monkey-patch via Object.defineProperty + require() - lifecycle: pure event listeners + getAppMetrics polling - utilityProcess: fork wrapper + 3 data flow paths (child-process-gone, parentPort __dd messages, dedicated MessagePort channel)
parentPort __dd messages leak into customer message handlers — suitable for one-shot crash-time error forwarding but not periodic telemetry. Dedicated MessagePort channel recommended for ongoing telemetry.
Map prototype datapoints to RUM events: - Processes as views (utility, renderer, GPU) - spawn/exec as native resources - Lifecycle events as errors/actions - Container hierarchy for renderer views - Schema fit assessment and product brief alignment
- Playwright fixtures (intake, electronApp, window) adapted from e2e - Smoke test: app launches, view event arrives at mock intake - flushTransport IPC for test assertions - Hidden window in test mode (DD_TEST_MODE)
- ChildProcessCollection monkey-patches spawn, exec, execFile (+ sync variants)
- Resources emitted with type:native, url:child_process://{command}, duration, status_code
- Error cases (ENOENT, timeout) captured with status_code:-1 and error context
- Filters SDK self-instrumentation (sw_vers)
- Uses require() instead of import to bypass Rollup namespace wrapper
- Playground buttons + scenarios for spawn-ls, exec-echo, spawn-fail, exec-timeout
- UtilityProcessCollection monkey-patches utilityProcess.fork()
- Fork starts a new view ("Utility: {serviceName}")
- Clean exit (code=0) emits action, abnormal exit emits error
- Listens to app 'child-process-gone' for crash enrichment
- New RawRumAction type for process lifecycle actions
- Demo worker + playground buttons (fork, send-message, crash)
- Poll app.getAppMetrics() every 2s, match by pid - Attach memory_average/memory_max to view context - Playground test asserts memory fields appear after poll
- RendererProcessCollection detects renderers via webContents polling - Creates views per renderer process with memory metrics - render-process-gone emits error on renderer view - webContents.id→pid mapping survives process death - Assembly sets container.view.id to renderer process view via senderPid - BridgeHandler extracts sender pid from IPC events
- Reentrant flag prevents exec→execFile→spawn chain from emitting multiple times - One-shot guard on spawn prevents both close and error from emitting - Tests now assert exactly 1 resource per command
commonContext hook set date to Date.now() at assembly time. Since ViewCollection did not set date on the raw event, every view update got the assembly timestamp instead of the view creation time.
- Use did-start-navigation instead of did-finish-load for early detection - Lazy getOrCreateRendererViewId backdates view to first event date - Guarantees renderer view exists before first browser-rum event - Test asserts renderer view date ≤ browser-rum view date
- Distinct serviceName per utility button (dd-demo-fork, dd-demo-message, dd-demo-crash-worker) - Fix send-message handler using once instead of on for multi-message flow - Add copy session ID button - Add crash renderer button and section
- Use view.memory_average and view.memory_max instead of context - These are existing RUM view fields (used by mobile SDKs) - Document CPU metrics gap: Electron provides seconds, schema expects ticks
- Replace app.getAppPath() with [APP_PATH] in serialized events - Resource URLs now use spawn://, exec://, execFile:// etc. instead of child_process:// - Document path sanitization approaches in findings
Keeps "Renderer: {title}" prefix for consistency with "Utility: {name}".
Pid remains in context only, allowing views to group by page title.
- Listen to page-title-updated to update view name from "unknown" to actual page title - Backdate renderer view startTime when earlier bridge events arrive - Test asserts view name contains page title and URL is sanitized
[APP_PATH] broke URL display in Datadog UI. Now strips the path entirely so file:///full/path/to/dist/index.html becomes file:///index.html. Removes debug logging from bridge and batch producer.
- [APP_PATH] placeholder breaks Datadog UI URL display - app.getAppPath() includes dist/ directory - Different strategies needed per field type
- Move webContents pid mapping to renderer tracking section - Clarify memory/CPU field display limitations - Clarify path sanitization UI behavior uncertainty - Align polling model with mobile SDK approach - Remove leftover console.log from renderer
Synthesizes findings from docs 00-04 into actionable work items organized by topic with value/complexity ratings, sequencing, and dd-trace integration evaluation as a decision gate.
- include view.name in utility process action/error events so assembly hooks don't override it with "main process" - filter ViewCollection counters by view.id to avoid counting utility process events in the main view
- parentPort piggyback + child.emit interception for error forwarding - customer imports '@datadog/electron-sdk/utility' in worker entry - rollup entry with strip-electron-require plugin - playground demo + E2E test - detailed findings doc (execArgv, NODE_OPTIONS, parentPort buffer, emit vs on/once)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Electron apps are multi-process by design, but the SDK only monitors the main process. Crashes, errors, and performance issues in child processes (utility, renderer, spawned commands) are invisible. This prototype explores how to bring full process visibility to RUM.
Not intended to be merged — this branch serves as a reference for the production implementation.
Changes
Utility: {serviceName}) with memory metrics, clean exit actions, and crash errorschild_processspawn/exec/execFile (sync + async) as RUM resources with duration and exit codesTest instructions
This is a prototype branch — not for production merge. To explore:
yarn install && yarn buildcd playground && yarn startAutomated:
yarn test(306 tests),cd playground && yarn test:e2e(12 Playwright scenarios)Checklist