Skip to content

⚗️ [RUM-15520] prototype child process monitoring#96

Draft
bcaudan wants to merge 37 commits into
mainfrom
bcaudan/prototype-child-processes
Draft

⚗️ [RUM-15520] prototype child process monitoring#96
bcaudan wants to merge 37 commits into
mainfrom
bcaudan/prototype-child-processes

Conversation

@bcaudan
Copy link
Copy Markdown
Collaborator

@bcaudan bcaudan commented Apr 16, 2026

Motivation

Electron apps are multi-process by design, but the SDK only monitors the main process. Crashes, errors, and performance issues in child processes (utility, renderer, spawned commands) are invisible. This prototype explores how to bring full process visibility to RUM.

Not intended to be merged — this branch serves as a reference for the production implementation.

Changes

  • Model utility processes as RUM views (Utility: {serviceName}) with memory metrics, clean exit actions, and crash errors
  • Model renderer processes as RUM views with container hierarchy linking browser-rum page views to their host process
  • Instrument child_process spawn/exec/execFile (sync + async) as RUM resources with duration and exit codes
  • Add a Playwright + mock intake test harness (12 scenarios) for autonomous validation
  • Research docs (landscape survey, RUM mapping, findings, conclusion) and demo materials
Screenshot 2026-04-16 at 17 37 17

Test instructions

This is a prototype branch — not for production merge. To explore:

  1. yarn install && yarn build
  2. cd playground && yarn start
  3. Use the demo buttons (Fork Utility, Send Message, Crash Utility, Spawn ls, etc.)
  4. Filter on the session ID in RUM Explorer to see process views and events

Automated: yarn test (306 tests), cd playground && yarn test:e2e (12 Playwright scenarios)

Checklist

  • Tested locally (playground)
  • Added unit tests for this change.
  • Added e2e/integration tests for this change.
  • Updated related documentation.

bcaudan added 30 commits April 14, 2026 11:28
Research documents for evaluating how the Electron SDK could monitor
child processes. Includes priority matrix across 6 mechanisms and
stub for prototype findings.
All tests pass. render-process-gone works for abnormal terminations
only (no noise on normal close). child-process-gone does not cover
Node.js child_process. getAppMetrics provides rich data with zero
instrumentation overhead.
Key findings: __importStar blocks patching (must use require()),
exec→execFile internal chain causes dedup issues, diagnostics_channel
not available for child_process in Node 22. Documents open questions
for bundler testing and production patching approach.
All scenarios work. Fork wrapper, MessagePort telemetry channel,
parentPort error forwarding, and child-process-gone all validated.
Key finding: crash vs exit(1) produce same child-process-gone reason,
parentPort error forwarding is the only way to distinguish them.
High-level overview of how each instrumentation works:
- spawn: monkey-patch via Object.defineProperty + require()
- lifecycle: pure event listeners + getAppMetrics polling
- utilityProcess: fork wrapper + 3 data flow paths (child-process-gone,
  parentPort __dd messages, dedicated MessagePort channel)
parentPort __dd messages leak into customer message handlers — suitable
for one-shot crash-time error forwarding but not periodic telemetry.
Dedicated MessagePort channel recommended for ongoing telemetry.
Map prototype datapoints to RUM events:
- Processes as views (utility, renderer, GPU)
- spawn/exec as native resources
- Lifecycle events as errors/actions
- Container hierarchy for renderer views
- Schema fit assessment and product brief alignment
- Playwright fixtures (intake, electronApp, window) adapted from e2e
- Smoke test: app launches, view event arrives at mock intake
- flushTransport IPC for test assertions
- Hidden window in test mode (DD_TEST_MODE)
- ChildProcessCollection monkey-patches spawn, exec, execFile (+ sync variants)
- Resources emitted with type:native, url:child_process://{command}, duration, status_code
- Error cases (ENOENT, timeout) captured with status_code:-1 and error context
- Filters SDK self-instrumentation (sw_vers)
- Uses require() instead of import to bypass Rollup namespace wrapper
- Playground buttons + scenarios for spawn-ls, exec-echo, spawn-fail, exec-timeout
- UtilityProcessCollection monkey-patches utilityProcess.fork()
- Fork starts a new view ("Utility: {serviceName}")
- Clean exit (code=0) emits action, abnormal exit emits error
- Listens to app 'child-process-gone' for crash enrichment
- New RawRumAction type for process lifecycle actions
- Demo worker + playground buttons (fork, send-message, crash)
- Poll app.getAppMetrics() every 2s, match by pid
- Attach memory_average/memory_max to view context
- Playground test asserts memory fields appear after poll
- RendererProcessCollection detects renderers via webContents polling
- Creates views per renderer process with memory metrics
- render-process-gone emits error on renderer view
- webContents.id→pid mapping survives process death
- Assembly sets container.view.id to renderer process view via senderPid
- BridgeHandler extracts sender pid from IPC events
- Reentrant flag prevents exec→execFile→spawn chain from emitting multiple times
- One-shot guard on spawn prevents both close and error from emitting
- Tests now assert exactly 1 resource per command
commonContext hook set date to Date.now() at assembly time. Since
ViewCollection did not set date on the raw event, every view update
got the assembly timestamp instead of the view creation time.
- Use did-start-navigation instead of did-finish-load for early detection
- Lazy getOrCreateRendererViewId backdates view to first event date
- Guarantees renderer view exists before first browser-rum event
- Test asserts renderer view date ≤ browser-rum view date
- Distinct serviceName per utility button (dd-demo-fork, dd-demo-message, dd-demo-crash-worker)
- Fix send-message handler using once instead of on for multi-message flow
- Add copy session ID button
- Add crash renderer button and section
- Use view.memory_average and view.memory_max instead of context
- These are existing RUM view fields (used by mobile SDKs)
- Document CPU metrics gap: Electron provides seconds, schema expects ticks
- Replace app.getAppPath() with [APP_PATH] in serialized events
- Resource URLs now use spawn://, exec://, execFile:// etc. instead of child_process://
- Document path sanitization approaches in findings
Keeps "Renderer: {title}" prefix for consistency with "Utility: {name}".
Pid remains in context only, allowing views to group by page title.
- Listen to page-title-updated to update view name from "unknown" to actual page title
- Backdate renderer view startTime when earlier bridge events arrive
- Test asserts view name contains page title and URL is sanitized
[APP_PATH] broke URL display in Datadog UI. Now strips the path entirely
so file:///full/path/to/dist/index.html becomes file:///index.html.
Removes debug logging from bridge and batch producer.
- [APP_PATH] placeholder breaks Datadog UI URL display
- app.getAppPath() includes dist/ directory
- Different strategies needed per field type
- Move webContents pid mapping to renderer tracking section
- Clarify memory/CPU field display limitations
- Clarify path sanitization UI behavior uncertainty
- Align polling model with mobile SDK approach
- Remove leftover console.log from renderer
Synthesizes findings from docs 00-04 into actionable work items
organized by topic with value/complexity ratings, sequencing,
and dd-trace integration evaluation as a decision gate.
bcaudan added 4 commits April 16, 2026 13:41
- include view.name in utility process action/error events so assembly
  hooks don't override it with "main process"
- filter ViewCollection counters by view.id to avoid counting utility
  process events in the main view
bcaudan added 3 commits April 17, 2026 09:44
- parentPort piggyback + child.emit interception for error forwarding
- customer imports '@datadog/electron-sdk/utility' in worker entry
- rollup entry with strip-electron-require plugin
- playground demo + E2E test
- detailed findings doc (execArgv, NODE_OPTIONS, parentPort buffer, emit vs on/once)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant