- Switch the local repository to
origin/windows-transports-rust-gowithout risking unrelated work. - Open a GitHub pull request for the Windows transport branch against
main. - Keep the clean PR head branch in sync with the latest commits on
windows-transports-rust-go. - Pull the repository after the PR merge and move the local worktree back to
main. - Perform a production-readiness review of the full library across Linux and Windows, for C, Rust, and Go, and decide whether it is ready to merge into Netdata.
- Design the Windows CI wiring needed to validate this library automatically on GitHub Actions before calling it production-ready for Netdata.
- User decision (2026-03-11, after remediation):
- commit the current fixes and push them
- use
ssh win11for real Windows validation before finalizing the commit - use
/c/Users/costa/src/plugin-ipc-win.gitonwin11for validation - reset that Windows test clone before applying the current patch and running tests
- next step is to wire the same Windows validation into CI
- Fact: local repository
/home/costa/src/plugin-ipc.gitis currently clean on branchmain(git status --short --branchshowed## main...origin/main). - Fact: remote branch
origin/windows-transports-rust-goexists and was fetched successfully on 2026-03-11. - Fact: local branch
windows-transports-rust-gonow tracksorigin/windows-transports-rust-go. - Fact:
origin/windows-transports-rust-gois 2 commits ahead oforigin/main:d9246b0 Add Rust/Go Windows transports, fix SHM store-load reordering race9f050c7 Fix Rust bench: use RDTSC timing instead of QPC per-iteration
- Fact: current diff versus
origin/maintouches 24 files with 4897 insertions and 97 deletions, including Windows transport code for C, Rust, and Go, Windows benchmark drivers/docs, and Windows smoke/live bench scripts. - Fact:
originremote HEAD branch ismain. - Fact: GitHub currently has no open or closed PR with head
netdata:windows-transports-rust-go. - Fact: both branch commits currently include
Co-Authored-Bytrailers, so opening a PR directly from that head would carry disallowed attribution text into the PR commit list. - Fact: clean branch
windows-transports-rust-go-prwas created fromorigin/main, replayed with sanitized commit messages, pushed toorigin, and used to open PR#1. - Fact: PR URL is
https://github.com/netdata/plugin-ipc/pull/1. - Fact: after
windows-transports-rust-goadvanced with43ee635andd910ab0, the clean PR branch was synced by replaying those commits with sanitized messages and pushing:b95b8f5 Update benchmark docs and README with post-RDTSC Rust performance965879e Fill Rust rate-limited 100K rps benchmark row
- Fact: the synced
windows-transports-rust-go-prbranch now matches the tree oforigin/windows-transports-rust-go. - Fact: local worktree is now back on
main, andmainwas fast-forwarded to merged commitb2bdcd5 Add Windows IPC transports and benchmark coverage (#1). - Fact: the local
TODO-plugin-ipc.mdnote was stashed for the pull and then reapplied cleanly on top ofmain. - Review findings (2026-03-11):
cmake --build build --target testfails on Linux because the Rust bench-driver manifest always includes the Windows binary and CMake builds the manifest without selecting a single bin;netipc-live-uds-interopandnetipc-uds-negotiation-negativeboth fail for this reason.GOOS=windows GOARCH=amd64 go test ./...fails because the Go POSIX transport package is not build-tagged out on Windows while its syscall helper definitions are unix-only.- POSIX SHM stale-endpoint takeover in both C and Rust only checks
owner_pideven though the shared region stores anowner_generation; PID reuse can therefore make stale endpoints appear live. - Windows validation scripts remain workstation-specific (
/c/Users/costa/..., hard-coded PATH/GOROOT), so the documented Windows validation path is not portable or CI-ready. - README still documents current limitations: Go/Rust rely on helper binaries for validated live paths, and Netdata integration wiring is explicitly out of scope in this repository phase.
- Current remediation plan:
- gate the Rust Windows bench binary behind an explicit feature and make CMake build explicit per-bin targets
- add proper unix build tags to the Go POSIX transport sources/tests
- strengthen POSIX SHM ownership metadata so stale-endpoint reclaim is resilient to PID reuse
- make Windows helper scripts path-configurable instead of Costa-workstation-specific
- rerun Linux validation and cross-build checks after the fixes
- run the smoke and live Windows checks on
win11before committing
- Current immediate request:
- switch locally to
origin/windows-transports-rust-go - create a PR from
windows-transports-rust-gotomain - sync the clean PR branch to the latest
origin/windows-transports-rust-go - pull the repository and check out
main - review the full codebase for production readiness and Netdata merge suitability
- commit and push the remediation changes after Windows validation on
win11
- switch locally to
- Current execution plan:
- keep the local repository on tracking branch
windows-transports-rust-go - keep the temporary clean PR branch available on
originas the PR head - leave the current repository on
windows-transports-rust-go - replay any new commits from
windows-transports-rust-goonto the clean PR branch with sanitized commit messages - verify the current worktree state after the interrupted turn
- update local refs
- switch the main worktree to
main - fast-forward local
maintoorigin/main - audit repository structure, build system, and test coverage
- inspect Linux and Windows transports in C, Rust, and Go
- run Linux-side validation that is possible in this environment
- validate the same changes on
win11with the real Windows runtime - if that validation passes, commit the specific files changed in this task and push
main - produce a review with concrete findings, risks, and a readiness recommendation
- keep the local repository on tracking branch
- Windows validation results on
win11(2026-03-11):- Fact:
/c/Users/costa/src/plugin-ipc-win.gitwas reset toorigin/main, cleaned, and used as the Windows validation clone. - Fact: the current local remediation patch was applied there cleanly with
git apply. - Fact:
go test ./...undersrc/gopassed on Windows when using the official Go installation. - Fact:
cargo check --manifest-path bench/drivers/rust/Cargo.toml --features windows-driver --bin netipc_live_win_rspassed on Windows. - Fact: the first smoke-script attempt failed because
win11has two Go installations and CMake selected/mingw64/bin/go.exe, which is broken on that host unlessGOROOTis set and its stdlib/tool versions match. - Fact:
tests/smoke-win.shandtests/run-live-win-bench.shwere then updated to auto-prefer/c/Program Files/Go/bin/go.exewhen present and to pass that executable into CMake configuration. - Fact: after the MSYS/native transition changes and the argument-conversion fix,
bash tests/smoke-win.shpassed onwin11with32 passed, 0 failed. - Fact: the smoke matrix covered the full directed interoperability set across:
c-nativec-msysrust-nativego-native- under both Windows transport profiles.
- Fact:
bash tests/run-live-win-bench.shcompleted successfully onwin11and produced a full 64-row directed benchmark matrix in/tmp/bench_results_4577.txt. - Fact: benchmark results confirm that
c-msysstays in the same performance class asc-nativeon the Windows transport. - Fact: benchmark results also expose a real client-side asymmetry:
- C and Rust clients drive SHM HYBRID at about 3.1M to 3.4M rps against C/Rust servers.
- Go clients top out closer to about 1.1M to 1.7M rps depending on the server implementation.
- Fact: a Windows-specific SHM spin sweep was executed on
win11usingrust-native -> rust-nativefirst, then a representative transition shortlist. - Fact: the broad Rust-only sweep over
4, 8, 16, 32, 64, 128, 256, 512, 1024showed that Windows is not in the same regime as Linux:- throughput stayed in named-pipe-class territory up to
64 128and256improved but were still far from the top-end SHM behavior512was the first value that restored full-rate SHM behavior for both max-rate and100k/s1024added CPU with essentially no throughput gain over512
- throughput stayed in named-pipe-class territory up to
- Fact: the refined Rust-only sweep around the knee (
320, 384, 448, 512, 640, 768) showed high Hyper-V noise but still narrowed the useful window to about384to640. - Fact: representative transition validation was then run for spins
384,448,512, and640on:rust-native -> rust-nativec-msys -> rust-nativerust-native -> c-msysrust-native -> go-native
- Fact: the representative results did not reveal a single clean Linux-style knee:
- lower spins such as
384often reduced CPU, but produced much worse100k/sp99 tails on transition pairs - higher spins such as
640improved several max-rate runs, but also raised CPU and were not consistently better across all pairs 512emerged as the most defensible current default candidate because it restored full-rate SHM behavior broadly, avoided the obvious low-spin collapse, and did not materially underperform640
- lower spins such as
- Fact:
- Goal: define how to wire Windows validation into CI for this repository.
- Need:
- inspect existing workflow patterns in this repo
- verify current official GitHub Actions guidance for Windows + MSYS2 + artifacts/caching
- recommend a CI structure that matches the validated local/
win11workflow with low maintenance risk - carry forward the full transition matrix requirement rather than collapsing back to same-language-only checks
- determine the Windows SHM HYBRID spin sweet spot instead of assuming the Linux value applies
- compare maximum-throughput gain versus CPU/tail-latency cost on
win11 - evaluate whether CI can identify and report Linux and Windows SHM spin sweet spots on production-like VMs
- inspect the existing Netdata agent
ebpf <-> cgroupsplugin-to-plugin communication path as a concrete reference before deciding how much ofplugin-ipcshould mirror established Netdata practice
- User clarification (2026-03-11):
- the transition requirement is not a POSIX transport on Windows
- the C library must compile under the MSYS runtime while still using the native Windows IPC transport
- that MSYS-built C variant must interoperate with native Windows Rust and Go binaries
benchmark-windows.mdmust include bothc-nativeandc-msys- benchmark scope should cover all directed implementation pairs, with each timed run capped at 5 seconds
- Linux already converged on
20spins as the best throughput-inflection point after testing powers of two and then refining near16 - Windows needs its own sweep because its current SHM HYBRID default is much higher than Linux
- question to answer now: can CI reliably identify the Linux and Windows sweet spots on real production VMs instead of only local test hosts
- latest Linux decision: use a safer higher POSIX SHM default of
128spins even if it increases CPU, because the priority is to avoid under-spinning on production VMs - new server-model proposal under discussion:
- callers should be able to integrate client and server sockets/handles into their own event loops
- servers must be multi-client; single-client server objects are a blocker
- the library must also support a managed server mode where the caller registers per-request-type callbacks and asks the library to run a worker pool and own request servicing until explicit shutdown
- low-level integration surface should expose native wait objects:
- POSIX:
fd - Windows:
HANDLE
- POSIX:
- Working repository:
/home/costa/src/plugin-ipc.git(local git repo initialized, branchmain). - Port status: source/tests/docs/tooling copied from
/home/costa/src/ipc-testand validated in-place. - Validation already run here:
make,./tests/run-interop.sh,./tests/run-live-uds-interop.sh(all pass). - Repository refactor status:
- approved Netdata-style layout implemented:
src/libnetdata/netipc/src/go/pkg/netipc/src/crates/netipc/
- helper programs moved under:
tests/fixtures/bench/drivers/
- root
CMakeLists.txtnow orchestrates mixed-language builds.
- approved Netdata-style layout implemented:
- Latest full validation after the refactor:
./tests/run-interop.sh./tests/run-live-interop.sh./tests/run-live-uds-interop.sh./tests/run-live-uds-bench.sh./tests/run-negotiated-profile-bench.sh./tests/run-uds-negotiation-negative.sh./tests/run-uds-seqpacket.sh- Result: all pass in the refactored tree.
- Cleanup status after user approval:
- obsolete legacy
interop/subtree removed - obsolete root
include/directory removed
- obsolete legacy
- Latest strategic decisions already recorded below:
- Repo identity:
netdata/plugin-ipc. - Coverage policy: 100% line + 100% branch for library source files.
- Benchmark CI model: GitHub-hosted cloud VMs (with repetition/noise controls).
- Windows baseline: Named Pipes, one native Windows library implementation (no separate MSYS2 variant).
- Windows builds must also work under MSYS2 POSIX emulation during the current Netdata transition.
- Windows C build mode: compile native Win32 code from MSYS2
mingw64/ucrt64, not the plainmsysruntime shell.
- Repo identity:
- Next starting point for the next session:
- Continue replacing placeholder Rust/Go library scaffolding with real reusable API implementations.
- Latest Windows probe findings not yet committed:
win11repo path used for testing is/c/Users/costa/src/plugin-ipc-win.git- MSYS2/POSIX C configure+build passes there
- native
MINGW64C build currently fails on POSIX-only headers (arpa/inet.h,poll.h,sys/mman.h) - Rust on
win11is currentlyx86_64-pc-windows-msvc, and the crate still tries to compile POSIX transport code on that Windows target - Go on
win11still shows a local toolchain inconsistency duringgo test(compile: version \"go1.26.1\" does not match go tool version \"go1.26.0\")
- New active task:
- pull the latest pushed Windows version from GitHub
- inspect the Windows implementation changes
- investigate the reported Windows throughput regression (
~16k req/s)
- Current immediate request:
- pull the latest pushed code locally for inspection without disrupting in-progress local notes
- Build cross-language IPC libraries for Netdata plugins in C, Rust, and Go.
- Each plugin can be both IPC server and IPC client.
- Cross-language compatibility is required (C <-> Rust <-> Go).
- Target POSIX (Linux/macOS/FreeBSD) and Windows.
- Use the lightest/highest-performance transport possible.
- Start with IPC transport benchmarks before final protocol/API decisions.
I have the following task: netdata plugins are independent processes written in various languages (usually: C, Go, Rust).
plugins are usually authoritative for some kind of information, for example cgroups.plugin knows the names of all containers and cgroups, apps.plugin knows everything about processes, ebpf collects everything from the kernel, network-viewer.plugin knows everything about sockets, netflow.plugin knows everything about ip-to-asn and ip-to-country, etc.
I want to develop a library in C, Rust and Go, enabling all plugins to become IPC servers and clients and therefore talk to each other, asking the right plugin to provide information about its domain. This should be cross-language compatible, so that a C plugin can ask a Rust plugin, or a Rust plugin to ask a Go plugin.
I am thinking of a setup that plugins expose a unix domain socket (on posix systems) at the netdata RUN_DIR. So the rust netflow.plugin exposes `ip-asn-geo.sock`. Any other plugin knows that it needs `ip-asn-geo` and just opens the socket and starts communicating without knowing anything about which plugin is the destination, in which language it is written, etc. So, the RUN_DIR of netdata will end up have socket files from various plugins, exposing data enrichment features in a dynamic way.
Ideally, I want 3x libraries: C, Rust, Go, with 2 implementations each: Posix (linux, macos, freebsd), Windows.
Authorization will be done by a shared SALT. So, Netdata already exposes a session id, a UUID which we can use to authorize requests. This means that the plugins will be able to use these services only when spawned by the same netdata.
Another very important requirement is performance. The transport layer of this communication should be the lightest possible. If there is a way for this communication to have smaller cpu impact and latency we must use that communication.
Regarding the internal client/server API, ideally I want these to be strongly typed, so these libraries should accept and return structured data. This may mean that in order to support a new type of message we may have to recompile the libraries. This is acceptable.
If we manage to have an transport layer that supports millions of requests/responses per second, we don't need a batch API. If however the transport is slow, we must then provide a way for the clients to batch multiple requests per message. For this reason I suggest first to implement a test comparing the various forms of IPC, so that we can choose the most performant one.
- Fact (2026-03-12): Netdata already has an ad-hoc plugin-to-plugin communication path between
cgroups.pluginandebpf.plugin. - Evidence:
- Shared-memory layout and names are defined in
/home/costa/src/netdata/netdata/src/collectors/cgroups.plugin/sys_fs_cgroup.h. - Producer initializes and populates the shared memory in
/home/costa/src/netdata/netdata/src/collectors/cgroups.plugin/cgroup-discovery.c. - Consumer opens, maps, and reads it in
/home/costa/src/netdata/netdata/src/collectors/ebpf.plugin/ebpf_cgroup.c.
- Shared-memory layout and names are defined in
- Fact: this existing path is not a generic RPC protocol.
- Fact: it is Linux/POSIX-only shared memory plus a named semaphore, with a fixed shared struct layout and direct in-process parsing by the consumer.
- Fact:
ebpf.pluginkeeps a local cache from this shared state and uses it widely across multiple modules, so this is effectively a periodically refreshed snapshot feed, not a request-per-lookup interface. - Evidence:
- the integration thread periodically maps or parses the shared memory in
/home/costa/src/netdata/netdata/src/collectors/ebpf.plugin/ebpf_cgroup.c. - downstream eBPF modules consume the cached data via
ebpf_cgroup_pids, for example in/home/costa/src/netdata/netdata/src/collectors/ebpf.plugin/ebpf_process.c.
- the integration thread periodically maps or parses the shared memory in
- Implication:
plugin-ipccould replace it functionally, but not as a naive 1:1 transport swap. The naturalplugin-ipcreplacement would be a snapshot/batch service with client-side caching, or a future shared-memory backend, not a per-item synchronous lookup in the hot path. - Implication: it proves the use case and some lifecycle patterns, but it is not a reusable cross-language/cross-OS service bus.
- Fact: Repository path
/home/costa/src/ipc-testis currently empty (no source files, no tests, no existing TODO file). - Fact: There is no pre-existing IPC implementation in this repo to extend.
- Fact: There is no existing build layout yet (no CMake/Cargo/go.mod).
- Fact: Host for first benchmark pass is Linux x86_64 (Manjaro, kernel 6.18.12).
- Fact: Toolchain availability confirmed: gcc 15.2.1, rustc 1.91.1, go 1.25.7.
- Fact: External performance tools available:
pidstat,mpstat,perf. - Implication: We should treat this as a greenfield design + benchmark project.
- Risk: Without early benchmark data, choosing protocol/serialization prematurely may lock in suboptimal latency/CPU costs.
- No active user-facing design decisions are currently blocking the protocol/API rewrite.
- The authoritative design is now captured in:
Derived Protocol SketchDerived Public API SketchDerived Method Example: ip_to_asnDerived Helper Naming / Builder SurfaceCurrent Phase Plan: Protocol/API Rewrite
- Pre-Netdata integration is now explicitly blocked on a hardening pass that addresses the robustness findings from Decision #90.
- This TODO file itself now requires cleanup before further major work:
- keep only the active requirements, analysis, made decisions, final design, hardening gate, and execution plan
- remove or archive the historical session narrative/progress transcript that no longer helps implementation
- 1-7. Earlier Windows CI / MSYS validation / spin-tuning questions are deferred until after the protocol/API rewrite lands on the baseline transports.
- Reason:
- the active rewrite changes the wire envelope, public API shape, method codecs, and batch model
- wiring CI or final performance policy before that would be premature
- Reason:
- 8-14. These are no longer open user questions.
- They are resolved by Decisions #44-#56 and by the derived protocol/API sections below.
-
Transport strategy for v1 benchmark candidate set: Option C (benchmark both stream and message-oriented candidates).
- Source: user decision "1c".
-
Serialization format for strongly typed cross-language messages: Option C (custom binary format, C struct baseline).
- Source: user decision "2c".
- User rationale: C struct is wire baseline; Rust and Go map to/from that binary format and native structures.
-
Service discovery model in
RUN_DIR: Option A (convention-based socket naming only).- Source: user decision "3a".
- User note: plugins are assumed same version; version mismatches can hard-fail on connect/handshake.
-
Authorization handshake: Option A (shared SALT/session UUID in initial auth message).
- Source: user decision "4a".
- User rationale: low-risk internal communication; socket OS permissions provide access control.
-
Batching policy: Option C (add only if benchmarks fail the performance target).
- Source: user decision "5c".
- Acceptance target: transport should reach 1M+ req/s on target high-end workstation with minimal CPU overhead.
-
Wire binary layout strategy: Option B (explicit field-by-field encode/decode with fixed endianness, preserving C-struct schema baseline).
- Source: user decision "1B".
-
POSIX transport benchmark breadth: Option C and broader (benchmark multiple POSIX methodologies, including
AF_UNIXvariants and shared memory + spinlock candidates).- Source: user decision "2C and even more... what about shared memory + spinlock?".
- User intent: identify best possible POSIX methodology by measurement, not assumption.
-
Connection lifecycle model for benchmarks: persistent sessions.
- Source: user clarification "not connect->request->response->disconnect per request; connect once ... disconnect on agent shutdown".
- Implication: benchmark steady-state request/response over long-lived connections/channels.
-
CPU threshold policy for first pass: no hard threshold upfront.
- Source: user clarification "I don't know the threshold... want most efficient latency, throughput, cpu utilization".
- Implication: first report is comparative and multi-metric; batching decision remains data-driven.
-
Shared-memory synchronization benchmark strategy: Option C (benchmark both spinlock-only and blocking/hybrid synchronization).
- Source: user agreement to proposed options.
- Benchmark mode scope: Option C (strict ping-pong baseline plus secondary pipelined mode).
- Source: user agreement to proposed options.
- CPU measurement method: Option C (collect both external sampling and in-process accounting).
- Source: user agreement to proposed options.
- Add and benchmark a shared-memory hybrid synchronization transport.
- Source: user decision "implement in the test the hybrid you recommend and measure it".
- Intent: measure if hybrid preserves low latency/high throughput while reducing unnecessary spin behavior.
- Prioritize single-threaded client ping-pong as the primary optimization target.
- Source: user clarification about apps.plugin-like enrichment loops.
- Intent: optimize for one client thread doing many sequential enrichments, then verify scaling with more clients.
- Add a rate-limited benchmark mode for hypothesis testing at fixed request rate.
- Source: user request to validate CPU behavior at 100k req/s.
- Intent: compare
shm-spinvsshm-hybridCPU utilization under the same fixed throughput target.
- Tune hybrid spin window from 256 to 64 and re-measure at 100k req/s.
- Source: user request "spin 64, not 256".
- Intent: test whether shorter spin window increases blocking and reduces CPU at fixed high request rate.
- Run a sweep over multiple hybrid spin values to map impact on:
- CPU utilization at fixed 100k req/s.
- Maximum throughput at unlimited rate.
- Optimize using "request-rate increase per spin" as the primary tuning unit for hybrid spin window.
- Source: user guidance that sweet spot is likely around 8 or 16.
- Intent: compute marginal throughput gain per added spin and identify diminishing-return point.
- Set the default shared-memory hybrid spin window to 20 tries.
- Source: user decision "yes, put 20".
- Intent: make the benchmark default align with the selected spin/throughput CPU tradeoff.
- Prefer
shm-hybridwith spin window 20 as the default method for plugin IPC.
- Source: user decision "I think 1 is our preferred method."
- Intent: optimize for lower CPU at target enrichment rates while keeping good throughput and acceptable latency.
- Execute the next implementation phase now:
- Freeze POSIX v1 baseline to
shm-hybrid(20). - Implement first typed request/response schema and C API.
- Add C/Rust/Go interoperability tests for the typed schema.
- Source: user decision "yes, do it".
- Proceed with stale-endpoint recovery and live cross-language transport interoperability validation.
- Source: user decision "proceed".
- Phase implementation choice: use Rust/Go FFI shims to the C transport API for live
shm-hybridsession tests in this iteration.
- Proceed to native Rust/Go live transport implementations (no dependency on
libnetipc.a) while preserving the same live interop matrix.
- Source: user decision "ok, proceed".
- Intent: validate cross-language
shm-hybridbehavior with independent Rust/Go transport code paths.
- Windows CI scope is not ready to proceed with native-only validation.
- Source: user decision "The MSYS2 (CYGWIN) path must be validated before continuing".
- Intent: require validation of the POSIX-emulation path on Windows before finalizing CI coverage.
- The C implementation must also build for the MSYS2 POSIX runtime target and be included in the CI matrix.
- Source: user decision "the C version must be built for MSYS2 (CYGWIN) target too and it must be added to the matrix."
- Intent: validate interoperability expectations across native Rust/Go and the POSIX-emulated C build on Windows.
- Transitional Windows support model:
- Source: user clarification on 2026-03-11.
- Fact pattern to validate:
- the C library must compile as native Windows code under the MSYS runtime, linked with
msys.dll/ POSIX emulation, because this matches how Netdata currently runs on Windows - the Rust and Go implementations must remain native Windows binaries without POSIX emulation
- during the transition to fully native Netdata on Windows, the MSYS-built C implementation must interoperate with the native Windows Rust and Go implementations
- the C library must compile as native Windows code under the MSYS runtime, linked with
- Implication: the required Windows transition matrix is not "MSYS C talking to POSIX transports"; it is "MSYS-built C using the Windows transport, interoperating with native Rust/Go over the same Windows IPC mechanisms"
- Windows benchmark reporting scope:
- Source: user decision on 2026-03-11.
- Requirement:
benchmark-windows.mdmust include results for bothc-nativeandc-msys. - Implication: benchmark tooling and documentation must distinguish the two C runtime environments explicitly instead of reporting a single generic
crow.
- Execution workflow for the transition work:
- Source: user decision on 2026-03-11.
- Requirement: implement locally in this repo, push the branch, then use
ssh win11to pull, compile, and run the Windows validation there.
- Typed-library requirement for all languages:
- Source: user clarification on 2026-03-11.
- Requirement:
- clients and servers must use strongly typed request/response structures in C, Rust, and Go
- the library, not the caller, must translate between typed structures and the wire frame
- every new message type therefore requires explicit request and response converters in all supported language implementations
- Implication: adding a new RPC is a schema + codec change in C, Rust, and Go, not just a transport change.
- Cross-platform discovery identity:
- Source: user decision on 2026-03-12.
- Decision:
- use
service_namespaceas the public cross-platform namespace term - clients resolve services by
service_namespace + service_name - POSIX maps this to
/run/netdata/<service>.sock - Windows maps this to a derived named-pipe name scoped by the namespace
- use
- Implication:
- discovery is service-oriented, not plugin-oriented
- one plugin process may expose multiple services without clients knowing plugin identity
- Service export model:
- Source: user decision on 2026-03-12.
- Decision:
- use one endpoint per service
- no separate registry daemon/file;
/run/netdatais the registry on POSIX - Windows uses the same logical identity, but encoded into named-pipe names
- Implication:
- service names are the stable public contract
- plugin/process identity remains an internal deployment detail
- Server API layering:
- Source: user decisions on 2026-03-12.
- Decision:
- expose both low-level and high-level server APIs
- the high-level managed server must be a thin wrapper over the low-level transport/session engine
- low-level server object hierarchy is:
server_hostservice_listenersession
- managed mode is a wrapper over
server_host
- Implication:
- one transport/session implementation is the source of truth
- high-level server helpers must not re-implement transport, framing, or negotiation logic
- Service registration lifecycle:
- Source: user decision on 2026-03-12.
- Decision:
- the exported service set is fixed after server startup for v1
- Implication:
- adding or removing services requires process restart
- service lifecycle stays simpler and avoids dynamic registration races in v1
- Handler contract:
- Source: user decision on 2026-03-12.
- Decision:
- handler shape is
ctx + typed_request -> typed_response
- handler shape is
- Implication:
- business code receives typed payloads
- request/session metadata remains available via the context object
- High-level client model:
- Source: user decisions on 2026-03-12.
- Decision:
- one persistent high-level client context per service, for example
ctx_ip_to_asn initialize()creates the context but does not require the provider to be up- caller owns reconnect cadence by calling
refresh(ctx)from its normal loop ready(ctx)must be a cheap cached predicatestatus(ctx)must expose detailed operational state/counters
- one persistent high-level client context per service, for example
- Implication:
- no hidden client background thread in v1
- startup order can remain random
- plugin hot paths can stay cheap while recovery happens in the outer loop
- High-level client call semantics:
- Source: user decisions on 2026-03-12.
- Decision:
- if previous state was not
READY,call_xxx(ctx, ...)must fail fast - if previous state was
READY, the call may reconnect once and resend once after failure - duplicate requests are acceptable by contract for the high-level API
- if previous state was not
- Implication:
- high-level API is intentionally at-least-once, not exactly-once
- low-level API remains the exact transport truth without automatic replay
- High-level convenience return shape:
- Source: user decisions on 2026-03-12.
- Decision:
- C convenience calls use
bool + out response - convenience APIs may return "no response" when unavailable instead of forcing a separate readiness branch at every call site
- keep both convenience APIs and strict checked APIs
- C convenience calls use
- Implication:
- hot-path call sites stay small
- C avoids heap-allocated response ownership traps
- detailed diagnostics remain available through the checked API and
status(ctx)
- Client status model:
- Source: user decisions on 2026-03-12.
- Decision:
- expose a detailed public status snapshot, not only a boolean
refresh(ctx)should returnchangedplus optional updated snapshot- public state model should include practical states such as:
DISCONNECTEDCONNECTINGREADYNOT_FOUNDAUTH_FAILEDINCOMPATIBLEBROKEN
- status should include reconnect counts and related operational counters
- Implication:
- plugins can adapt behavior cheaply via
ready(ctx)and inspect richer state viastatus(ctx) - transition callbacks/logging can later be layered over the same state model
- plugins can adapt behavior cheaply via
- Advanced low-level throughput goal:
- Source: user clarification on 2026-03-12.
- Decision:
- one hot client must be able to drive more than one server worker
- the low-level API must support advanced modes beyond simple ping-pong, including event-loop integration via native wait objects (
fdon POSIX,HANDLEon Windows)
- Implication:
- request-level parallelism matters, not just multi-client concurrency
- a session-sticky "one worker per client" managed-server model is insufficient for the advanced path
- Protocol direction change:
- Source: user clarification on 2026-03-12.
- Decision:
- the final library protocol cannot remain fixed at 64 bytes
- requests and responses must support variable payload sizes, including strings
- move to a fixed header plus variable payload design
- Implication:
- the current 64-byte
INCREMENTframe is only a benchmark scaffold - all transport helpers must evolve from exact fixed-size frame I/O to header+payload I/O
- the current 64-byte
- Advanced-throughput v1 strategy:
- Source: user decision on 2026-03-12.
- Decision:
- prioritize ordered batch request/response support for v1
- defer general pipelining to a future release if ordered batch is sufficient
- negotiated connection limits should include both:
max_batch_itemsmax_payload_bytes
- Implication:
- v1 can target the main "one client needs more throughput" problem with lower complexity than full general pipeline semantics
- batch limits become part of handshake/compatibility policy
- Batch execution model:
- Source: user decision on 2026-03-12.
- Decision:
- the server may split one incoming batch into contiguous parts and hand each part to a worker
- responses are reassembled in the original request order before sending the batch response
- preference is for simple deterministic splitting over per-item worker pull/work-stealing for v1
- User rationale:
- request handling is expected to be mostly uniform memory-lookup work
- simple split also degenerates cleanly to the single-worker case, where one worker handles the full batch
- Implication:
- ordered batch responses remain deterministic
- more sophisticated per-item dynamic balancing can be deferred unless measurements show meaningful skew
- Batch response semantics:
- Source: user decision on 2026-03-12.
- Decision:
- batch responses must carry per-item status/result, not only whole-batch success/failure
- Implication:
- one bad item in a batch does not poison the whole batch
- batch response envelopes must preserve item order and include status for each item
- Variable-length envelope and ordered homogeneous batches:
- Source: user decision on 2026-03-12.
- Decision:
- use one universal fixed header plus variable payload envelope
- one batch must contain requests of a single method only
- ordered homogeneous batch payloads correlate request/response items by array position
- Implication:
- single-request and batch messages share one outer protocol
- batch codecs stay strongly typed and service-specific
- Outer-envelope status semantics:
- Source: user clarification on 2026-03-12.
- Decision:
- the outer envelope status is transport/protocol status only
- it means things like "this message was received/validated/responded at the envelope level"
- it must not be used to represent per-item business outcomes such as "item not found"
- Implication:
- per-item or per-method outcomes stay inside the typed response content / batch item response entries
- outer status stays small, generic, and transport-oriented
- Batch body layout direction:
- Source: user clarification on 2026-03-12.
- Decision:
- batch items must be variable-size
- fixed-size negotiated slots are not acceptable because requests/responses may carry strings of very different lengths
- the active v1 direction is:
- outer envelope with
item_count - followed by an item directory of offsets and lengths
- followed by the packed aligned item payloads
- outer envelope with
- Implication:
- batch parsing is a two-step process: directory first, then payload slices
- item correlation remains by array position, while offsets/lengths provide flexible sizing
- Self-contained zero-allocation payload contract:
- Source: user clarification on 2026-03-12.
- Decision:
- each request/response payload must be self-contained
- the outer envelope/directory only identifies the payload byte range for each item
- payload decoders must be able to interpret the payload in place without heap allocation
- variable-length fields such as strings should be addressable through offsets inside the payload
- string data inside payloads should be NUL-terminated so decoders can expose direct pointers/slices to the underlying bytes
- Implication:
- typed decode helpers should behave like views over payload bytes, not deep-copy constructors
- each method payload may need its own internal header/offset table for variable-length members
- View lifetime contract:
- Source: user decision on 2026-03-12.
- Decision:
- decoded request/response views are highly ephemeral
- they are valid only within the current library call / current callback invocation
- callers must either use the view immediately or copy the needed data before the function/callback returns
- callers must assume the view becomes invalid at the next library call, and in practice treat it as invalid once the current function/callback returns
- Implication:
- the library is free to reuse internal buffers aggressively without preserving old decoded views
- high-level APIs must document this lifetime rule explicitly in C, Rust, and Go
- Per-method payload layout discipline:
- Source: user decision on 2026-03-12 (
15.A). - Decision:
- every method payload uses a small fixed method header for scalar fields plus a method-local offset/length directory for its variable-length members
- Implication:
- the outer envelope stays transport-oriented
- each method payload remains self-contained and decodable in place
- String field representation inside method payloads:
- Source: user decision on 2026-03-12 (
16.A). - Decision:
- each string field is represented by
offset + length - the pointed bytes must also end with
\\0
- each string field is represented by
- Implication:
- C gets cheap direct pointer compatibility
- Rust and Go get O(1) slicing without scanning for the terminator
- Decoded payload/view lifetime model:
- Source: user decision on 2026-03-12 (
17.A), later tightened by Decision #48. - Decision:
- decoded request/response objects are non-owning views over the underlying message bytes
- Implication:
- zero-allocation decoding is the default contract
- ownership is caller-managed and requires explicit copying outside the library call/callback
- Ephemeral-view naming and documentation discipline:
- Source: user decision on 2026-03-12.
- Decision:
- public decoded request/response types must be named and documented so their ephemeral lifetime is impossible to miss
- type names, field names where appropriate, and comments must clearly signal that these are borrowed/view types, not durable owned objects
- API documentation must state that the data is valid only within the current library call / callback and must be copied immediately if it needs to outlive that scope
- Implication:
- naming should prefer explicit view semantics such as
...View - comments and docs must aggressively warn against retaining pointers/slices beyond the allowed lifetime
- naming should prefer explicit view semantics such as
- Derived consequence of self-contained response payloads:
- Source: derived from Decisions #43, #45, #47, #49, #50, and #51 on 2026-03-12.
- Derived design:
- batch item descriptors should carry only
offset + length - per-item business/method outcome status belongs inside each response payload's self-contained method-local layout
- the outer envelope status remains transport/protocol-only
- batch item descriptors should carry only
- Implication:
- batch request and batch response directories can use the same descriptor shape
- decoders only need the descriptor to locate each self-contained payload view
-
Status: derived from the agreed design decisions above; not a new decision by itself.
-
Outer message header (
v1) should be a single fixed 32-byte envelope:magic: u32version: u16header_len: u16kind: u16(control,request,response)flags: u16(at leastbatch)code: u16- request/response: method id
- control: control opcode
transport_status: u16- envelope-level / transport-level / protocol-level only
- never business/item-level result
payload_len: u32- bytes after the outer header
item_count: u321for non-batch messagesNfor batch messages
message_id: u64- request/response correlation id
-
Batch item directory:
- one fixed descriptor shape for both request and response batches:
offset: u32length: u32
- offsets are relative to the start of the packed item payload area
- request/response item correlation is by array position
- one fixed descriptor shape for both request and response batches:
-
Batch message payload layout:
[ item_ref[0] ... item_ref[N-1] ][ aligned item payload 0 ][ aligned item payload 1 ] ...- request batch and response batch use the same outer batch layout
- each item payload is self-contained and method-specific
-
Single message payload layout:
- no directory table
- one self-contained method/control payload follows the outer header
-
Method payload discipline:
- each method payload has:
- a small fixed method-local header
- method-local scalar fields
- method-local
offset + lengthpairs for variable-length members - packed variable data area
- outer envelope never knows inner method field layout
- each method payload has:
-
String fields inside method payloads:
- represented by
offset + length - pointed bytes must also terminate with
\\0 - decoders should expose borrowed views/pointers/slices, not owned copies
- represented by
-
Lifetime model:
- decoded request/response objects are ephemeral views only
- they are valid only within the current library call / callback
-
Status: derived from Decisions #35-#38, #47-#56; intended as the concrete direction for the next implementation phase.
-
High-level API split:
- low-level:
- transport/session primitives
- raw envelope send/receive
- method payload encode/decode to ephemeral
...Viewtypes
- high-level:
- fixed per-service client context
- fixed per-service managed server registration
- zero-copy request/response handling via callbacks
- optional explicit copy/materialize helpers for callers that need ownership
- low-level:
-
High-level client context model (all languages):
- one context per service, for example
ip_to_asn - created once
- refreshed periodically by the caller
- cheap
ready()predicate - detailed
status()snapshot - zero-copy call path is callback-based
- one context per service, for example
-
C shape (derived):
- owned request input struct, for example:
struct netipc_ip_to_asn_req { const char *ip_text; uint32_t ip_text_len; };
- ephemeral response view type:
struct netipc_ip_to_asn_resp_view { ... netipc_str_view as_name_view; ... };
- client context:
netipc_ip_to_asn_client_t *
- zero-copy call:
bool netipc_ip_to_asn_call_view(..., netipc_ip_to_asn_resp_view_cb cb, void *user);
- optional explicit copy helper:
- copy a response view into a caller-owned output struct
- owned request input struct, for example:
-
Rust shape (derived):
- owned/borrowed request input struct appropriate for encoding, for example:
IpToAsnRequest<'a> { ip_text: &'a str, ... }
- ephemeral response view type:
IpToAsnResponseView<'a> { ... as_name_view: StrView<'a>, ... }
- client context:
IpToAsnClient
- zero-copy call:
call_ip_to_asn_view(&mut self, req, timeout, |view| { ... })
- optional explicit materialize helper:
- convert/copy a view into an owned response struct
- owned/borrowed request input struct appropriate for encoding, for example:
-
Go shape (derived):
- owned request input struct for encoding, for example:
type IPToASNRequest struct { IPText string }
- ephemeral response view type:
type IPToASNResponseView struct { ... ASNameView CStringView ... }
- client context:
type IPToASNClient struct { ... }
- zero-copy call:
CallIPToASNView(req, timeout, func(view IPToASNResponseView) bool)
- optional explicit materialize helper:
- copy a view into an owned Go response struct
- owned request input struct for encoding, for example:
-
Managed server callback shape (derived):
- request side:
- callback receives an ephemeral decoded request view
- request view is valid only during the callback
- response side:
- callback writes the response through a method-specific response builder backed by library-managed scratch storage
- this avoids heap allocation while still allowing variable-length response fields
- C example shape:
bool netipc_ip_to_asn_handler(void *user, const netipc_ip_to_asn_req_view *req_view, netipc_ip_to_asn_resp_builder_t *resp_builder);
- Rust example shape:
FnMut(&IpToAsnRequestView<'_>, &mut IpToAsnResponseBuilder<'_>) -> bool
- Go example shape:
func(reqView IPToASNRequestView, resp *IPToASNResponseBuilder) bool
- request side:
-
Naming/documentation rule:
- decoded borrowed types should always contain
Viewin the public type name - docs/comments must explicitly state:
- borrowed
- ephemeral
- valid only during current library call / callback
- copy immediately if needed later
- decoded borrowed types should always contain
-
Status: derived example only; intended to prove the current design is usable end-to-end.
-
Service identity:
service_namespace = /run/netdataon POSIXservice_name = ip-to-asn
-
Semantic contract:
- request:
- caller asks for ASN enrichment for one textual IP address
- response:
- returns method/business result code
- if found, returns ASN and zero-copy string views for metadata
- request:
-
Method id:
- fixed service/method-specific code in the outer envelope
codefield
- fixed service/method-specific code in the outer envelope
-
Request payload example (
ip_to_asn):- method-local fixed header:
layout_version: u16flags: u16ip_text_off: u32ip_text_len: u32
- packed variable data:
ip_textbytes followed by\\0
- total request payload is self-contained and decodable in place
- method-local fixed header:
-
Response payload example (
ip_to_asn):- method-local fixed header:
layout_version: u16result_code: u16- method/business result, for example:
- found
- not found
- invalid input
- method/business result, for example:
asn: u32as_name_off: u32as_name_len: u32cc_off: u32cc_len: u32
- packed variable data:
as_namebytes followed by\\0ccbytes followed by\\0
- response payload remains self-contained and decodable in place
- method-local fixed header:
-
C shape (derived example):
- request input:
struct netipc_ip_to_asn_req { const char *ip_text; uint32_t ip_text_len; };
- string view:
struct netipc_str_view { const char *ptr; uint32_t len; };
- response view:
struct netipc_ip_to_asn_resp_view {uint16_t result_code;uint32_t asn;struct netipc_str_view as_name_view;struct netipc_str_view cc_view;};
- zero-copy client call:
bool netipc_ip_to_asn_call_view(..., netipc_ip_to_asn_resp_view_cb cb, void *user);
- managed server callback:
bool netipc_ip_to_asn_handler(void *user, const struct netipc_ip_to_asn_req_view *req_view, netipc_ip_to_asn_resp_builder_t *resp_builder);
- copy helper:
- explicit helper copies a response view into a caller-owned durable output struct
- request input:
-
Rust shape (derived example):
- request input:
pub struct IpToAsnRequest<'a> { pub ip_text: &'a str }
- borrowed string view:
pub struct StrView<'a> { pub bytes: &'a [u8] }
- response view:
pub struct IpToAsnResponseView<'a> {pub result_code: u16,pub asn: u32,pub as_name_view: StrView<'a>,pub cc_view: StrView<'a>,}
- zero-copy client call:
call_ip_to_asn_view(&mut self, req, timeout, |view| { ... })
- managed server callback:
FnMut(&IpToAsnRequestView<'_>, &mut IpToAsnResponseBuilder<'_>) -> bool
- explicit materialize helper:
- copy a borrowed response view into an owned Rust response type
- request input:
-
Go shape (derived example):
- request input:
type IPToASNRequest struct { IPText string }
- borrowed string wrapper:
type CStringView struct { ... }
- response view:
type IPToASNResponseView struct {ResultCode uint16ASN uint32ASNameView CStringViewCCView CStringView}
- zero-copy client call:
CallIPToASNView(req, timeout, func(view IPToASNResponseView) bool)
- managed server callback:
func(reqView IPToASNRequestView, resp *IPToASNResponseBuilder) bool
- explicit materialize helper:
- copy a borrowed response view into an owned Go response struct
- request input:
-
Builder discipline (all languages):
- response builders should write into library-managed scratch/output buffers
- handlers should set scalars and append string fields through builder methods
- builder must guarantee:
- offset/length bookkeeping
- trailing
\\0for string fields - alignment/padding of packed variable data
-
View lifetime reminder:
- request and response views in this example are valid only during the current callback / library call
- callers must not retain pointers/slices/borrowed wrappers after that scope ends
-
Status: derived from the agreed API and lifetime rules; intended to keep naming and helper behavior consistent across C, Rust, and Go.
-
Naming discipline:
- owned encode input types do not use
View - decoded borrowed request/response types always use
View - response builders always use
Builder - explicit copy helpers should use verbs like:
copymaterializeto_owned
- owned encode input types do not use
-
C codec/helper naming (derived):
- encode owned request input into payload bytes:
netipc_ip_to_asn_req_encode(...)
- decode request payload into ephemeral request view:
netipc_ip_to_asn_req_decode_view(...)
- decode response payload into ephemeral response view:
netipc_ip_to_asn_resp_decode_view(...)
- initialize/reset response builder:
netipc_ip_to_asn_resp_builder_init(...)netipc_ip_to_asn_resp_builder_reset(...)
- set scalar response fields:
netipc_ip_to_asn_resp_builder_set_result_code(...)netipc_ip_to_asn_resp_builder_set_asn(...)
- append string response fields:
netipc_ip_to_asn_resp_builder_set_as_name(...)netipc_ip_to_asn_resp_builder_set_cc(...)
- finalize builder into payload bytes:
netipc_ip_to_asn_resp_builder_finish(...)
- explicit copy/materialize helper:
netipc_ip_to_asn_resp_view_copy(...)
- encode owned request input into payload bytes:
-
Rust codec/helper naming (derived):
- encode owned/borrowed request input into payload bytes:
encode_ip_to_asn_request(...)
- decode request payload into ephemeral request view:
decode_ip_to_asn_request_view(...)
- decode response payload into ephemeral response view:
decode_ip_to_asn_response_view(...)
- response builder type and methods:
IpToAsnResponseBuilderset_result_code(...)set_asn(...)set_as_name(...)set_cc(...)finish(...)
- explicit materialize helper:
IpToAsnResponseView::to_owned()
- encode owned/borrowed request input into payload bytes:
-
Go codec/helper naming (derived):
- encode owned request input into payload bytes:
EncodeIPToASNRequest(...)
- decode request payload into ephemeral request view:
DecodeIPToASNRequestView(...)
- decode response payload into ephemeral response view:
DecodeIPToASNResponseView(...)
- response builder type and methods:
IPToASNResponseBuilderSetResultCode(...)SetASN(...)SetASName(...)SetCC(...)Finish(...)
- explicit materialize helper:
func (v IPToASNResponseView) ToOwned() IPToASNResponse
- encode owned request input into payload bytes:
-
Builder behavior rules:
- builder methods must never expose raw offset bookkeeping to handlers
- builder owns:
- packed variable-data placement
- offset/length assignment
- trailing
\\0insertion for strings - alignment/padding
finish(...)returns one self-contained method payload- handlers should only set semantic fields, not wire-layout details
-
Validation rules for decode helpers:
- reject out-of-bounds offsets/lengths
- reject overlapping/invalid field regions when the method schema forbids them
- reject string fields missing the required trailing
\\0 - reject payloads shorter than the fixed method-local header
-
Goal:
- replace the current fixed 64-byte benchmark protocol and typed-return transport APIs with the agreed variable-length, self-contained, zero-allocation view model
- freeze the directional handshake contract before more implementation continues
- deliver the project in strict phases, with each phase tested and documented before the next begins
- validate the first snapshot/cache-backed service with a fake producer and dummy data inside this repository before touching the Netdata repository
-
Critical implementation constraint (fact-based):
- the current SHM transports are single-slot and fixed-frame:
- POSIX SHM region still stores one request frame and one response frame
- Windows SHM region still stores one request frame and one response frame
- implication:
- the current SHM design cannot carry the new variable-length ordered-batch protocol without a separate redesign
- therefore:
- phase 1 of this protocol/API rewrite should target the multiplexable baseline transports first:
- POSIX
UDS_SEQPACKET - Windows
Named Pipe
- POSIX
- SHM redesign should follow as a separate phase after the new envelope/method API is stable
- phase 1 of this protocol/API rewrite should target the multiplexable baseline transports first:
- the current SHM transports are single-slot and fixed-frame:
-
Phase 1: directional handshake contract rewrite (all languages)
- Replace the current symmetric hello / hello-ack contract with the approved directional model:
- request direction:
max_request_payload_bytesmax_request_batch_items
- response direction:
max_response_payload_bytesmax_response_batch_items
- batch bytes derived per direction
- request direction:
- Update:
- C:
src/libnetdata/netipc/include/netipc/netipc_schema.hsrc/libnetdata/netipc/src/protocol/netipc_schema.c
- Rust:
src/crates/netipc/src/protocol.rssrc/crates/netipc/src/lib.rs
- Go:
src/go/pkg/netipc/protocol/frame.go
- C:
- Add protocol tests for:
- encode/decode roundtrips
- negotiation success/failure
- limit mismatch rejection
- Replace the current symmetric hello / hello-ack contract with the approved directional model:
-
Phase 2: baseline transport migration to the new protocol core
- Migrate baseline transports first:
- POSIX
UDS_SEQPACKET - Windows
Named Pipe
- POSIX
- Replace fixed-frame /
increment-shaped paths with generic variable-length message handling - Execution detail for this phase:
- first add generic message send/receive/call primitives on the baseline transports
- keep the existing fixed-frame
incrementhelpers as compatibility wrappers on top during migration - do not force SHM onto the new generic message path in this phase; SHM stays on the fixed-frame compatibility path until its dedicated redesign phase
- Preserve:
- control handshake
- single-message request/response
- ordered homogeneous batch request/response
- Target files:
- POSIX C:
src/libnetdata/netipc/include/netipc/netipc_uds_seqpacket.hsrc/libnetdata/netipc/src/transport/posix/netipc_uds_seqpacket.c
- Windows C:
src/libnetdata/netipc/include/netipc/netipc_named_pipe.hsrc/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c
- Rust:
src/crates/netipc/src/transport/posix.rssrc/crates/netipc/src/transport/windows.rs
- Go:
src/go/pkg/netipc/transport/posix/seqpacket.gosrc/go/pkg/netipc/transport/windows/pipe.go
- POSIX C:
- Migrate baseline transports first:
-
Phase 3: first real method family and helper foundation
- Implement the first real method family as the approved fake
cgroups-style snapshot service - Introduce:
- owned request encode input
- ephemeral request/response view decode helpers
- response/snapshot builders
- explicit copy/materialize helpers
- Go borrowed string wrapper type
- Keep the service contract cleaned, not a direct copy of current Netdata SHM structs
- Implement the first real method family as the approved fake
-
Phase 4: fake producer + cache-backed helper layer
- Build a fake producer with dummy cgroup/container records inside this repository
- Build the higher-level cache-backed helper layer strictly on top of the generic client/service core
- Validate flows like:
refresh_cgroups()lookup_cgroup()
- No Netdata-repo integration in this phase
-
Phase 5: full tests and interop for the fake snapshot service
- Protocol tests:
- handshake
- envelope validation
- directional limit enforcement
- Method tests:
- snapshot payload decode
- builder correctness
- lifetime/documentation-sensitive behavior
- Interop tests:
- C <-> Rust <-> Go for the fake snapshot service
- Reliability tests:
- reconnect after previously-READY failure
- refresh/cache rebuild correctness
- malformed offsets/lengths rejection
- Protocol tests:
-
Phase 6: performance coverage on baseline transports
- Benchmark:
- ping-pong baseline
- ordered-batch request/response
- snapshot refresh path
- local cache lookup hot path
- Validate:
- throughput
- p50/p95/p99
- correctness
- CPU impact
- Benchmark:
-
Phase 7: SHM redesign for the real protocol
- Redesign POSIX and Windows SHM paths to support the real variable-length protocol and snapshot publication model
- Then run the same fake snapshot/cache service and performance coverage on SHM
-
Phase 8: real Netdata integration
- Only after the fake producer/service is fully validated in this repository
- Add adaptation from current Netdata producer/consumer data shapes to the cleaned
plugin-ipcservice contract
-
Phase 9: documentation rewrite
- Update:
- architecture doc
- protocol spec
- client/server developer guide
- cache-backed helper usage
- lifetime warnings for all
...Viewtypes
- Explicitly document:
- views are ephemeral
- copy now or lose data
- transport status vs method/business result
- directional request/response handshake limits
- negotiated payload/batch limits
- Update:
-
Migration order recommendation:
- first make the protocol/method layer independent of transports
- then migrate baseline transports
- then layer high-level client/server APIs on top
- then add ordered-batch benchmarks
- then revisit SHM redesign
-
Expected non-goals for this phase:
- no attempt to preserve the old fixed 64-byte wire format as the long-term protocol
- no attempt to force the current single-slot SHM transports to carry the new protocol unchanged
- no hidden allocation-based fallback masquerading as the default zero-copy API
- Current status:
- Phase 1 is now implemented for the protocol core and all current live handshake paths
- the protocol core no longer uses the old symmetric hello / hello-ack sizing contract
- the live POSIX UDS and Windows named-pipe transport handshakes now embed the approved directional hello / hello-ack payloads instead of the old ad-hoc symmetric negotiation fields
- legacy fixed-frame
incrementencode/decode APIs are still kept in place for compatibility during migration
- Landed in Phase 1:
- protocol-core support for the new outer message envelope in C, Rust, and Go
- support for:
- fixed 32-byte outer header
offset + lengthbatch item refs- directional hello / hello-ack payloads
- directional request/response payload and batch-item ceilings
- negotiated single-payload default constant of
1024
- POSIX UDS live handshake rewritten to reuse the directional protocol payloads
- Windows named-pipe live handshake rewritten to reuse the same directional protocol payloads in C, Rust, and Go
- Go raw-hello negative-test driver updated to speak the same directional handshake
- Files changed in this phase so far:
- C:
src/libnetdata/netipc/include/netipc/netipc_schema.hsrc/libnetdata/netipc/src/protocol/netipc_schema.csrc/libnetdata/netipc/src/transport/posix/netipc_uds_seqpacket.csrc/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c
- Rust:
src/crates/netipc/src/protocol.rssrc/crates/netipc/src/lib.rssrc/crates/netipc/src/transport/posix.rssrc/crates/netipc/src/transport/windows.rs
- Go:
src/go/pkg/netipc/protocol/frame.gosrc/go/pkg/netipc/protocol/frame_test.gosrc/go/pkg/netipc/transport/posix/seqpacket.gosrc/go/pkg/netipc/transport/posix/seqpacket_unix.gosrc/go/pkg/netipc/transport/windows/pipe.gobench/drivers/go/main.go
- C:
- Validation completed after the directional handshake rewrite:
cargo test --manifest-path src/crates/netipc/Cargo.tomlgo test ./...fromsrc/goGOOS=windows GOARCH=amd64 go test ./...fromsrc/gocmake --build build/usr/bin/ctest --test-dir build --output-on-failure- on
win11:PATH=/c/Users/costa/.cargo/bin:$PATH cargo test --manifest-path src/crates/netipc/Cargo.tomlgo test ./...fromsrc/gobash tests/smoke-win.shwith result32 passed, 0 failed
- Important remaining gap after Phase 1:
- baseline transports still expose the old fixed-frame /
increment-specific public API - Phase 2/3 migration is still needed before the public library surface matches the new
...View/ callback-based design
- baseline transports still expose the old fixed-frame /
- Current blocker status:
- there is no longer an unresolved blocker for continuing autonomously into Phase 2
- the first real service family is already approved as the fake cleaned
cgroupssnapshot service, so the next implementation work can proceed according to the phase plan
- Current status:
- Phase 2 now has the first end-to-end baseline-transport slice implemented and validated
- the protocol core has one authoritative derived-size calculation for:
- aligned item payload bytes
- maximum ordered-batch payload bytes
- maximum total message bytes
- baseline transports now also carry negotiated directional message ceilings and expose generic variable-message primitives on the baseline paths
- Landed in this Phase 2 slice:
- C protocol helpers added in:
src/libnetdata/netipc/include/netipc/netipc_schema.hsrc/libnetdata/netipc/src/protocol/netipc_schema.c
- C baseline transport generic message APIs added in:
src/libnetdata/netipc/include/netipc/netipc_uds_seqpacket.hsrc/libnetdata/netipc/include/netipc/netipc_named_pipe.hsrc/libnetdata/netipc/src/transport/posix/netipc_uds_seqpacket.csrc/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c
- Rust protocol helpers and tests added in:
src/crates/netipc/src/protocol.rs
- Rust baseline transport generic message APIs added in:
src/crates/netipc/src/transport/posix.rssrc/crates/netipc/src/transport/windows.rs
- Go protocol helpers and tests added in:
src/go/pkg/netipc/protocol/frame.gosrc/go/pkg/netipc/protocol/frame_test.go
- Go baseline transport generic message APIs added in:
src/go/pkg/netipc/transport/posix/seqpacket.gosrc/go/pkg/netipc/transport/windows/pipe.go
- C protocol helpers added in:
- Why this landed first:
- baseline transports need one shared formula for directional receive/send buffer ceilings
- adding it first avoided C/Rust/Go drift while the generic variable-message transport primitives were wired in
- the baseline transports needed negotiated directional message ceilings stored in transport contexts before any real method migration could start
- Validation completed for this slice:
cargo test --manifest-path src/crates/netipc/Cargo.tomlgo test ./...fromsrc/goGOOS=windows GOARCH=amd64 go test ./...fromsrc/gocmake --build build/usr/bin/ctest --test-dir build --output-on-failure
- Remaining Phase 2 work after this slice:
- generic message APIs exist, but no real variable-length service/method is using them yet
- the legacy fixed-frame
incrementhelpers are intentionally still raw fixed-frame compatibility paths on the baseline transports - the next practical Phase 2/3 step is to implement the first real method family on top of these generic message APIs
- SHM still remains on the old fixed-frame compatibility path until its dedicated redesign phase
- Important limitation of this slice:
- generic variable-message transport support is now present on the baseline transports, but the existing
incrementfixtures and interop tests still exercise the legacy fixed-frame path - this is intentional for compatibility during migration; the first real validation of the new generic message path will happen with the approved fake cleaned
cgroupssnapshot service
- generic variable-message transport support is now present on the baseline transports, but the existing
- Current status:
- the first real variable-length method family is now implemented at the shared protocol/codec layer in C, Rust, and Go
- this first real method family is the approved fake cleaned
cgroupssnapshot service, using the real semantic fields the currentebpfconsumer depends on:hashnameoptionsenabledpath- snapshot-level
systemd_enabled
- the generic message path is now exercised by the file-based cross-language interop fixtures, not only by per-language unit tests
- Landed in this Phase 3 slice:
- method id and wire layout constants for the fake cleaned
cgroupssnapshot service added in:src/libnetdata/netipc/include/netipc/netipc_schema.hsrc/libnetdata/netipc/src/protocol/netipc_schema.csrc/crates/netipc/src/protocol.rssrc/go/pkg/netipc/protocol/frame.go
- zero-allocation decode/view support added for:
- snapshot request views
- snapshot response views
- per-item
name/pathstring views
- response builders added for the ordered snapshot response in:
src/libnetdata/netipc/src/protocol/netipc_schema.csrc/crates/netipc/src/protocol.rssrc/go/pkg/netipc/protocol/frame.go
- Rust root exports updated in:
src/crates/netipc/src/lib.rs
- file-based codec fixtures upgraded to speak the new generic message path in:
tests/fixtures/c/netipc_codec_tool.ctests/fixtures/rust/src/bin/netipc_codec_rs.rstests/fixtures/go/main.go
- cross-language schema interop upgraded in:
tests/run-interop.sh
- method id and wire layout constants for the fake cleaned
- Validation completed for this slice:
cargo test --manifest-path src/crates/netipc/Cargo.tomlgo test ./...fromsrc/goGOOS=windows GOARCH=amd64 go test ./...fromsrc/gocargo test --manifest-path tests/fixtures/rust/Cargo.toml --bin netipc-codec-rscmake --build build/usr/bin/ctest --test-dir build --output-on-failure
- What this slice proves:
- the first real generic-message method family now round-trips in all three languages
- the first real generic-message method family now interoperates across C, Rust, and Go through the file-based fixtures
- the fake cleaned
cgroupssnapshot service contract is no longer just a TODO design; it now exists in executable codec form
- Remaining work after this slice:
- baseline transport wrappers still do not expose high-level
cgroupssnapshot helpers - there is not yet a fake producer process or cache-backed client helper
- snapshot refresh / lookup flow is still not exercised end-to-end over the baseline transports
- baseline transport wrappers still do not expose high-level
- Current status:
- the baseline transport generic-message APIs now carry the negotiated directional request/response ceilings all the way from config to the live client/server contexts
- this removed the last hardcoded single-item/default-size assumption from the baseline handshake/config layer before the first fake snapshot service uses multi-item responses
- Landed in this slice:
- directional request/response payload and batch-item limits added to:
src/libnetdata/netipc/include/netipc/netipc_uds_seqpacket.hsrc/libnetdata/netipc/include/netipc/netipc_named_pipe.hsrc/crates/netipc/src/transport/posix.rssrc/crates/netipc/src/transport/windows.rssrc/go/pkg/netipc/transport/posix/seqpacket.gosrc/go/pkg/netipc/transport/windows/pipe.go
- live baseline handshakes now negotiate against those configured directional ceilings instead of old hardcoded defaults in:
src/libnetdata/netipc/src/transport/posix/netipc_uds_seqpacket.csrc/libnetdata/netipc/src/transport/windows/netipc_named_pipe.csrc/crates/netipc/src/transport/posix.rssrc/crates/netipc/src/transport/windows.rssrc/go/pkg/netipc/transport/posix/seqpacket.gosrc/go/pkg/netipc/transport/windows/pipe.go
- directional request/response payload and batch-item limits added to:
- Validation completed for this slice:
cargo test --manifest-path src/crates/netipc/Cargo.tomlgo test ./...fromsrc/goGOOS=windows GOARCH=amd64 go test ./...fromsrc/gocmake --build build
- Important constraint exposed by this slice:
- the transport clients are still eager-connect objects
- the approved
initialize()/refresh()lifecycle for cache-backed services must therefore live in the higher-level helper layer, not in the baseline transport client types themselves
- Next implementation target after this slice:
- add the first cache-backed helper on top of the generic baseline message APIs
- add the first fake producer process that serves the cleaned
cgroupssnapshot service end-to-end over the baseline transports - SHM still remains on the legacy compatibility path until its dedicated redesign phase
- Current status:
- the first transport-backed fake
cgroupssnapshot/cache slice now exists in C on the baseline POSIX and Windows baseline transport paths - this is the first end-to-end refresh/cache flow that uses:
- the real generic message transport API
- the cleaned fake
cgroupssnapshot method family - a higher-level cache-backed helper layered above the baseline transport client
- the first transport-backed fake
- Landed in this slice:
- high-level cache-backed C helper added in:
src/libnetdata/netipc/include/netipc/netipc_cgroups_snapshot.hsrc/libnetdata/netipc/src/service/netipc_cgroups_snapshot.c
- matching cache-backed Rust helper added in:
src/crates/netipc/src/service/mod.rssrc/crates/netipc/src/service/cgroups_snapshot.rs
- matching cache-backed Go helper added in:
src/go/pkg/netipc/service/cgroupssnapshot/client.gosrc/go/pkg/netipc/service/cgroupssnapshot/client_unix.gosrc/go/pkg/netipc/service/cgroupssnapshot/client_windows.go
- fake live producer/consumer fixture added in:
tests/fixtures/c/netipc_cgroups_live.c
- baseline live smoke workflow added in:
tests/run-live-cgroups-baseline.shtests/run-live-cgroups-win.sh
- CMake wiring added for:
- fixture target
netipc-cgroups-live-c - POSIX
ctestworkflow registration fornetipc-live-cgroups-baseline - Windows
ctestworkflow registration fornetipc-live-cgroups-win
- fixture target
- high-level cache-backed C helper added in:
- What this slice validates:
- lazy high-level helper lifecycle on top of the eager baseline transport client
- full-snapshot refresh using the generic message path
- local cache rebuild and lookup by
hash + name - fake producer behavior suitable for later rehearsal against the real
cgroups -> ebpfreplacement path - the same refresh/cache helper methodology now exists in all three library languages:
- C
- Rust
- Go
- Important bug exposed and fixed during this slice:
- the fake live server initially allocated its request buffer only for the actual 36-byte single request
- the generic baseline transport API requires receive buffers to be sized for the full negotiated maximum message length
- the fixture now allocates request/response buffers from the negotiated service limits instead of hard-coded message sizes
- the Windows named-pipe transport still referenced removed fixed-frame helpers (
read_pipe_frame()/write_pipe_frame()) after the generic-message migration - compatibility wrappers were restored on top of
read_pipe_message()/write_pipe_message()insrc/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c - the first Windows smoke attempt also exposed two workflow-only issues in
tests/run-live-cgroups-win.sh:- native Windows output arrived with
\r\n, so the captured client output needed\rnormalization before assertions - the loop refresh path prints
REFRESHESbeforeCGROUPS_CACHE, so the smoke assertions now validate exact required lines by presence instead of assuming fixed line order
- native Windows output arrived with
- Validation completed for this slice:
cmake --build build --target netipc-cgroups-live-cbash tests/run-live-cgroups-baseline.shcargo test --manifest-path src/crates/netipc/Cargo.tomlcargo check --manifest-path src/crates/netipc/Cargo.toml --target x86_64-pc-windows-gnugo test ./...fromsrc/goGOOS=windows GOARCH=amd64 go test ./...fromsrc/go/usr/bin/ctest --test-dir build --output-on-failure- on
win11:bash tests/run-live-cgroups-win.sh- result:
8 passed, 0 failed - matrix covered:
c-nativeserver/clientc-nativeserver +c-msysclientc-msysserver +c-nativeclientc-msysserver/client- each in both
refresh onceandrefresh loopmodes
- Remaining work after this slice:
- the live fake producer/consumer fixture currently exists only in C
- Rust/Go now have transport-backed helper coverage in library tests, but they do not yet have matching live snapshot fixtures/CLI tools
- Rust helper has Windows-target compile coverage, but not a real Windows runtime smoke yet
- Go helper has Windows-target compile coverage, but not a real Windows runtime smoke yet
- the fake
cgroupssnapshot schema is still Linux-specific even though the refresh/cache methodology is now validated on Windows baseline too
- Current status:
- the fake
cgroupssnapshot service now has real live client coverage in all three languages on Linux baseline transports - the live producer is still C-only, but the client-side refresh/cache helper path is now exercised end-to-end through:
- C client helper
- Go client helper
- Rust client helper
- the fake
- Landed in this slice:
- Go fixture CLI extended with live cache-helper commands in:
tests/fixtures/go/main.go
- Rust fixture CLI extended with live cache-helper commands in:
tests/fixtures/rust/src/bin/netipc_codec_rs.rs
- baseline live smoke expanded from C-only to a mixed-language client matrix in:
tests/run-live-cgroups-baseline.sh
- POSIX
ctestworkflow dependencies expanded so the live fake-service test always builds the Go and Rust fixture binaries too:CMakeLists.txt
- Go fixture CLI extended with live cache-helper commands in:
- What this slice validates:
- the same fake C snapshot producer can now be consumed live by:
- C client helper
- Go client helper
- Rust client helper
- output/lookup semantics stay identical across the three client implementations for:
client-refresh-onceclient-refresh-loop
- the cache-backed helper layer is no longer validated only through unit/library tests in Go and Rust; it is now exercised through real baseline transport sessions too
- the same fake C snapshot producer can now be consumed live by:
- Validation completed for this slice:
cmake --build build --target netipc-cgroups-live-c netipc-codec-go netipc-codec-rsbash tests/run-live-cgroups-baseline.shcargo test --manifest-path src/crates/netipc/Cargo.tomlgo test ./...fromsrc/goGOOS=windows GOARCH=amd64 go test ./...fromsrc/goGOOS=windows GOARCH=amd64 go build -o /tmp/netipc-codec-go.exe .fromtests/fixtures/gocargo check --manifest-path tests/fixtures/rust/Cargo.toml --bin netipc-codec-rs --target x86_64-pc-windows-gnu/usr/bin/ctest --test-dir build --output-on-failure
- Additional Windows runtime validation completed after this slice:
- the dedicated
win11validation clone was updated with the current task files by overwrite-only sync (no reset/clean) bash tests/run-live-cgroups-win.shnow also builds and exercises:- native Go fake-service helper CLI
- native Rust fake-service helper CLI
- result on
win11:16 passed, 0 failed- covered:
- existing C matrix:
c-nativeserver/clientc-nativeserver +c-msysclientc-msysserver +c-nativeclientc-msysserver/client
- new helper-client matrix:
c-nativeserver +go-nativeclientc-nativeserver +rust-nativeclientc-msysserver +go-nativeclientc-msysserver +rust-nativeclient
- each in both
refresh onceandrefresh loopmodes
- existing C matrix:
- the dedicated
- Important fact exposed by this slice:
- the new Rust live fake-service client path is Windows-target clean when checking the actual fixture binary used for fake-service work (
netipc-codec-rs) - the full Rust fixtures manifest is still not Windows-target clean, but that is due to the older POSIX-only
netipc_live_rsbenchmark/live fixture binary, not the newcgroupshelper path
- the new Rust live fake-service client path is Windows-target clean when checking the actual fixture binary used for fake-service work (
- Important workflow fix exposed by the Windows helper run:
- the first
win11helper attempt failed because the Go helper output path intests/run-live-cgroups-win.shwas relative totests/fixtures/go, while later checks expected the binary under the repo-root build directory - the script now uses repo-root absolute helper binary paths for both:
netipc-codec-go.exenetipc-codec-rs.exe
- the first
- Remaining work after this slice:
- live fake-service producer coverage is still C-only; there is still no Rust or Go fake snapshot producer fixture
- mixed-language live fake-service coverage currently means:
- C producer -> C/Go/Rust clients
- not yet Go producer or Rust producer -> other-language clients
- the fake
cgroupssnapshot schema remains Linux-specific even though the refresh/cache methodology is now cross-platform
- Current status:
- the fake
cgroupssnapshot service now has real live producer/client coverage in all three languages on Linux baseline transports - the same methodology now also has real runtime coverage on Windows baseline transports for all current producer/client implementations:
c-nativec-msysgo-nativerust-native
- the fake
- Landed in this slice:
- Go fixture CLI extended with live fake-service producer commands in:
tests/fixtures/go/main.gotests/fixtures/go/cgroups_server_common.gotests/fixtures/go/cgroups_server_unix.gotests/fixtures/go/cgroups_server_windows.go
- Rust fixture CLI extended with live fake-service producer commands in:
tests/fixtures/rust/src/bin/netipc_codec_rs.rs
- baseline live Linux smoke expanded from
C producer -> C/Go/Rust clientsto a fullC/Go/Rustproducer/client matrix in:tests/run-live-cgroups-baseline.sh
- Windows live smoke expanded from the helper-client-only matrix to a full
c-native/c-msys/go-native/rust-nativeproducer/client matrix in:tests/run-live-cgroups-win.sh
- Go fixture CLI extended with live fake-service producer commands in:
- What this slice validates:
- the fake
cgroupssnapshot service is now exercised live in both directions across the current implementation set instead of only as:C producer -> other-language clients
- the fake producer behavior and the cache-backed client helper semantics stay identical across:
- C
- Go
- Rust
- the cross-platform snapshot/cache methodology now has real runtime proof on Windows too, not only compile-target sanity
- the fake
- Validation completed for this slice:
- Linux:
cmake --build build --target netipc-cgroups-live-c netipc-codec-go netipc-codec-rsbash tests/run-live-cgroups-baseline.sh- result:
- full
C/Go/Rustproducer/client matrix - each in both
refresh onceandrefresh loopmodes
- full
- additional local sanity:
cargo test --manifest-path src/crates/netipc/Cargo.tomlgo test ./...fromsrc/goGOOS=windows GOARCH=amd64 go test ./...fromsrc/gocargo check --manifest-path tests/fixtures/rust/Cargo.toml --bin netipc-codec-rs --target x86_64-pc-windows-gnu
- Windows on
win11after overwrite-only sync of the current task files:bash tests/run-live-cgroups-win.sh- result:
32 passed, 0 failed- full
4 x 4producer/client matrix across:c-nativec-msysgo-nativerust-native
- each in both
refresh onceandrefresh loopmodes
- Linux:
- Important fact exposed by this slice:
- the fake-service runtime matrix is now strong enough to close the current Phase 5 interop goal for the fake snapshot/cache service on baseline transports
- the remaining Rust Windows-target manifest gap is still limited to the older POSIX-only
netipc_live_rsfixture binary and does not affect the fake-service CLI used here
- Remaining work after this slice:
- Phase 6 baseline performance coverage is now the next active phase:
- snapshot refresh path
- local cache lookup hot path
- cross-language producer/client benchmark coverage on baseline transports
- SHM redesign remains deferred to Phase 7 exactly as planned
- Phase 6 baseline performance coverage is now the next active phase:
- Context:
- the approved phase plan requires baseline performance coverage for:
- snapshot refresh path
- local cache lookup hot path
- the user did not freeze exact rate-limited targets for this new fake-service benchmark family
- the approved phase plan requires baseline performance coverage for:
- Implementation choice:
- the first benchmark scripts will ship with two default scenarios for both refresh and local lookup paths:
max(target_rps = 0)1000/s
- the scripts will keep these scenario sets configurable via environment variables so they can be widened later without rewriting the helper binaries
- the first benchmark scripts will ship with two default scenarios for both refresh and local lookup paths:
- Rationale:
- this follows the existing repo pattern of pairing
maxwith at least one fixed-rate scenario 1000/sis conservative enough for a first snapshot/cache benchmark while still giving rate-limited latency/CPU signal
- this follows the existing repo pattern of pairing
- Facts:
- baseline benchmark workflows now exist for the fake
cgroupssnapshot/cache service on both OS families:- Linux baseline script:
tests/run-live-cgroups-bench.sh - Windows baseline script:
tests/run-live-cgroups-win-bench.sh
- Linux baseline script:
- the benchmark helpers now exist in all three language fixture sets:
- C:
server-benchclient-refresh-benchclient-lookup-bench
- Go:
server-benchclient-refresh-benchclient-lookup-bench
- Rust:
server-benchclient-refresh-benchclient-lookup-bench
- C:
- the build-system bug that previously left the Go fixture binary stale when new Go fixture source files were added is now fixed in
CMakeLists.txtby tracking the full Go fixture source set as explicit dependencies. - the Windows fake-service smoke-script wait-path bug is fixed in
tests/run-live-cgroups-win.sh. - Windows disconnect handling is now normalized for the fake snapshot/cache runtime paths in:
- Go named-pipe transport:
ERROR_BROKEN_PIPEERROR_NO_DATAERROR_PIPE_NOT_CONNECTED
- Rust named-pipe transport:
ERROR_BROKEN_PIPEERROR_NO_DATAERROR_PIPE_NOT_CONNECTED
- Go named-pipe transport:
- baseline benchmark workflows now exist for the fake
- Validation completed for this slice:
- Linux:
bash tests/run-live-cgroups-bench.sh- result:
- full directed refresh matrix for
C/Go/Rust - local lookup self/self benchmark for
C/Go/Rust
- full directed refresh matrix for
- repo-level workflow validation:
cmake --build build --target netipc-bench-live-cgroupscmake --build build --target netipc-bench- result:
- the new cgroups benchmark is now fully integrated into the benchmark workflow umbrella, not only runnable as a standalone script
- additional local sanity:
cargo test --manifest-path src/crates/netipc/Cargo.tomlgo test ./...fromsrc/goGOOS=windows GOARCH=amd64 go test ./...fromsrc/gocargo check --manifest-path tests/fixtures/rust/Cargo.toml --bin netipc-codec-rs --target x86_64-pc-windows-gnu
- Windows on
win11:bash tests/run-live-cgroups-win-bench.sh- result:
- full directed refresh matrix across:
c-nativec-msysgo-nativerust-native
- local lookup self/self benchmark for the same four implementations
- full directed refresh matrix across:
- Linux:
- Key measured facts:
- Linux refresh max throughput reaches the low-hundreds-of-thousands requests/s class on the best producer/client pairs:
c -> rust: about201k req/srust -> rust: about206k req/sgo -> go: about119k req/s
- Linux local lookup max throughput is in the multi-million lookups/s class:
c -> c: about25.3M lookups/srust -> rust: about23.0M lookups/sgo -> go: about13.8M lookups/s
- Linux rate-limited
1000/sscenarios hold the target cleanly. - Windows refresh max throughput is much lower for this first fake snapshot/cache service:
- roughly
37-53 req/sacross the directed matrix onwin11
- roughly
- Windows local lookup max throughput splits into two classes:
go-native: about26.1M lookups/sc-native,c-msys,rust-native: about69k-75k lookups/s
- Windows rate-limited
1000/slocal lookup scenarios hold the target cleanly.
- Linux refresh max throughput reaches the low-hundreds-of-thousands requests/s class on the best producer/client pairs:
- Working theory (explicit speculation):
- the very low Windows refresh throughput is more likely a property of the current fake full-snapshot refresh/rebuild path on baseline transports than a proof that the underlying transport methodology is invalid.
- this needs later stress-testing during the SHM redesign phase and when a less synthetic refresh path exists.
- Important fact exposed by this slice:
- Phase 6 baseline performance coverage is now complete for the approved fake snapshot/cache service methodology on baseline transports.
- the next active phase is Phase 7:
- SHM redesign for the real protocol and snapshot publication/consumption model.
- Phase 7 SHM redesign scope:
- Status: approved on 2026-03-13 after Phase 6 baseline coverage completed.
- Clarification added on 2026-03-13:
- the goal is not to maintain two separate user-visible SHM products.
- the desired end state is one SHM subsystem that, if realistic, can satisfy both:
- generic request/response IPC
- server-owned snapshot publication with client-side refresh/cache rebuild
- the remaining question is whether a single SHM design can serve both well, or whether the first implementation phase should optimize for one flow first and extend later.
- Context:
- the current POSIX SHM path still exposes one fixed-size request slot and one fixed-size response slot in
src/libnetdata/netipc/src/transport/posix/netipc_shm_hybrid.c. - the current Windows SHM path mirrors the same fixed-frame single-request/single-response shape in
src/crates/netipc/src/transport/windows.rs. - both SHM paths still expose
receive_increment/send_increment/call_increment-style APIs. - the approved fake
cgroupssnapshot/cache service is different:- payloads are variable-size
- the server owns snapshot size and layout
- the client refreshes a local cache from that server-owned payload
- the current POSIX SHM path still exposes one fixed-size request slot and one fixed-size response slot in
- Evidence:
- POSIX SHM region layout:
request_frame[NETIPC_FRAME_SIZE]response_frame[NETIPC_FRAME_SIZE]- in
src/libnetdata/netipc/src/transport/posix/netipc_shm_hybrid.c
- Windows SHM region layout:
- request slot and response slot sized as
FRAME_SIZE - in
src/crates/netipc/src/transport/windows.rs
- request slot and response slot sized as
- baseline fake service methodology is now validated on baseline transports and is the active next transport target.
- POSIX SHM region layout:
- Options:
- Option A:
- redesign SHM first as a snapshot-publication transport for cache-backed services
- server owns one published snapshot region per service
- clients refresh by consuming that published snapshot
- Pros:
- best fit for the real
cgroups -> ebpfreplacement target - directly addresses the approved fake service methodology
- avoids forcing snapshot publication into a per-request session model first
- best fit for the real
- Cons:
- generic request/response SHM still comes later
- Implications:
- Phase 7 stays tightly aligned with the current fake
cgroupsservice and the first real Netdata target
- Phase 7 stays tightly aligned with the current fake
- Risks:
- later generic RPC SHM may still need a second SHM subphase
- Option B:
- redesign SHM first as a generic per-session variable-message transport
- snapshot services keep using the normal request/response/session model on top of it
- Pros:
- more generic foundation
- one SHM path for all service types
- Cons:
- weaker fit for server-owned snapshot publication
- more complex for the first real SHM target
- Implications:
- snapshot/cache service continues to behave like a full refresh RPC over SHM
- Risks:
- may fail to deliver the main snapshot/publication benefit we actually want
- Option C:
- implement both SHM modes in Phase 7:
- snapshot-publication SHM for cache-backed services
- generic per-session variable-message SHM for normal RPC
- Pros:
- covers the whole design immediately
- Cons:
- by far the biggest scope jump
- Implications:
- Phase 7 becomes the hardest implementation phase by a wide margin
- Risks:
- highest chance of delay and rework
- implement both SHM modes in Phase 7:
- Option D:
- redesign SHM once around a single generic publication/buffer model that can satisfy both:
- generic request/response IPC
- server-owned snapshot publication and client-side refresh
- Pros:
- one SHM subsystem
- avoids architectural split
- directly addresses the user's challenge that both patterns should ideally share one design
- Cons:
- requires a stronger first design than the old single-slot ping-pong model
- likely needs a control-plane/data-plane split inside the same SHM subsystem
- Implications:
- Phase 7 becomes a unified SHM redesign instead of a mode-first redesign
- Risks:
- if the generic model is too abstract or too slow, it could underperform both use cases
- redesign SHM once around a single generic publication/buffer model that can satisfy both:
- Option A:
- Recommendation:
- Option D
- Reason:
- shared memory itself does not force two products; the real requirement is to replace the current fixed-slot ping-pong layout with one generic shared-data model that can carry both RPC responses and published snapshots.
- User decision:
- Option D approved on 2026-03-13.
- Decision:
- Phase 7 will implement one unified SHM subsystem.
- This SHM subsystem must be able to satisfy both:
- generic request/response IPC
- server-owned snapshot publication with client-side refresh/cache rebuild
- The redesign must avoid introducing two separate user-visible SHM products.
- Implication:
- the next SHM work should focus on a shared control/data model, not on separate snapshot-mode and RPC-mode products.
- Derived unified SHM control/data model:
- Status: derived from approved Decision #71 on 2026-03-13.
- Context:
- the current SHM layouts are still fixed-frame single-slot ping-pong regions:
- POSIX:
src/libnetdata/netipc/src/transport/posix/netipc_shm_hybrid.c - Windows:
src/crates/netipc/src/transport/windows.rs
- POSIX:
- the approved real protocol is variable-length, directional, and supports:
- normal request/response calls
- server-owned snapshot publication with client-side refresh/cache rebuild
- the fake
cgroupshelper already models the response side as server-owned and sized from negotiated response limits:src/libnetdata/netipc/src/service/netipc_cgroups_snapshot.c
- the current SHM layouts are still fixed-frame single-slot ping-pong regions:
- Evidence:
- POSIX SHM region still hardcodes:
request_frame[NETIPC_FRAME_SIZE]response_frame[NETIPC_FRAME_SIZE]
- Windows SHM region still mirrors the same request-slot / response-slot model.
- official SHM APIs are generic shared-memory primitives with user-defined synchronization:
- POSIX
shm_overview(7) - Windows file mapping documentation
- POSIX
- POSIX SHM region still hardcodes:
- Derived design:
- SHM will be redesigned as one unified subsystem with:
- one shared control header
- one client-owned variable request publication area
- one server-owned variable response publication area
- each direction publishes at most one variable-length payload at a time in the first implementation slice
- publication metadata per direction will include:
- generation / sequence
- published byte length
- transport status
- optional close / waiting hints
- the payload bytes will contain the already-approved real protocol envelope:
- outer header
- optional batch item refs
- self-contained method payloads
- request/response and snapshot refresh therefore use the same SHM pattern:
- client publishes request bytes in the request publication area
- server consumes them
- server publishes response or snapshot bytes in the response publication area
- client consumes them
- SHM will be redesigned as one unified subsystem with:
- Implication:
- the first SHM slice does not need a separate “snapshot mode”
- snapshot/cache services will simply use the same publication model with server-owned response bytes
- Risk:
- this first slice still supports only one in-flight publication per direction
- if later low-level pipelining is required on SHM, the control plane will need extension rather than another redesign
- Phase 7 first implementation slice:
- Status: derived on 2026-03-13 before coding started.
- Scope:
- implement the unified SHM publication model first in the C POSIX SHM path
- keep the old
incrementwrappers as compatibility helpers on top of the new generic message flow during migration - use the same negotiated directional limits already approved in Decision #64
- Reason:
- the C fake
cgroupshelper and current SHM code live here first - this gives the fastest honest end-to-end validation path before porting the same model to Windows, Rust, and Go
- the C fake
- Implication:
- Phase 7 will be delivered in internal slices, but all slices must conform to the same unified SHM model
- the first slice is not a design fork; it is only the implementation order
- Execution record:
- Implemented on 2026-03-13 in the C POSIX SHM path:
src/libnetdata/netipc/include/netipc/netipc_shm_hybrid.hsrc/libnetdata/netipc/src/transport/posix/netipc_shm_hybrid.csrc/libnetdata/netipc/src/transport/posix/netipc_uds_seqpacket.c
- The old fixed-frame direct SHM slots were replaced with:
- one shared control header
- one variable-length client-owned request publication area
- one variable-length server-owned response publication area
- Compatibility kept:
- old
incrementwrappers still exist on top of the new generic byte-message path - UDS negotiation now upgrades to the new SHM message path for the C POSIX slice
- old
- Implemented on 2026-03-13 in the C POSIX SHM path:
- Root cause found and fixed during validation:
- the first live SHM loop runs exposed an intermittent client-side failure during SHM endpoint creation
- concrete evidence:
- the client could open the
.ipcshmfile in the small window afteropen(O_CREAT|O_EXCL)and before the server hadftruncate()d and populated the region header - in that state
fstat()succeeded with size0, and the client path leaked a stale socketerrno(EAGAIN) instead of reporting a retryable protocol-not-ready state
- the client could open the
- fix applied:
- treat undersized freshly created SHM files as
EPROTO/ protocol-not-ready - this keeps
create_shm_client_with_retry()on the intended retry path until the server finishes SHM region setup
- treat undersized freshly created SHM files as
- Validation:
- focused reproducer:
100consecutive SHM client/server loop runs passed after the fix
- repo scripts:
tests/run-live-cgroups-shm.shpassed repeatedly (20consecutive runs)tests/run-live-interop.shpassed with the full directC <-> RustSHM matrixtests/run-live-uds-interop.shpassed with the negotiatedprofile=2C <-> RustSHM matrix
- full repo validation:
- Rust:
cargo test --manifest-path src/crates/netipc/Cargo.toml - Go:
go test ./... - Go cross-build:
GOOS=windows GOARCH=amd64 go test ./... - CTest:
/usr/bin/ctest --test-dir build --output-on-failurepassed7/7
- Rust:
- focused reproducer:
- Execution record (Rust POSIX widening):
- Implemented on 2026-03-13 in the Rust POSIX SHM path:
src/crates/netipc/src/transport/posix.rs
- The Rust POSIX SHM transport now uses the same unified control/data model as the C POSIX slice:
- shared control header
- variable-length client-owned request publication area
- variable-length server-owned response publication area
- Compatibility kept:
- old
incrementwrappers still exist on top of the new generic byte-message path - UDS negotiation now upgrades to the new SHM message path for
C <-> Rustas well
- old
- Live validation restored:
- direct SHM matrix:
tests/run-live-interop.shnow coversC->C,C->Rust,Rust->C, andRust->Rust - negotiated SHM matrix:
tests/run-live-uds-interop.shnow coversprofile=2forC->C,C->Rust,Rust->C, andRust->Rust
- direct SHM matrix:
- Implemented on 2026-03-13 in the Rust POSIX SHM path:
- Execution record (Windows widening):
- Implemented on 2026-03-13 in the Windows SHM paths:
src/libnetdata/netipc/src/transport/windows/netipc_shm_hybrid_win.csrc/libnetdata/netipc/src/transport/windows/netipc_named_pipe.csrc/crates/netipc/src/transport/windows.rssrc/go/pkg/netipc/transport/windows/pipe.go
- The C, Rust, and Go Windows transports now use the same unified SHM header/control model:
- dynamic request/response publication capacities
- directional negotiated message-size ceilings
- generic byte-message SHM calls
- legacy fixed-frame
incrementwrappers preserved on top for compatibility
- Root cause found and fixed during live validation:
- first
win11smoke after the SHM port showed that:Rust <-> Rust,Go <-> Go, andRust <-> Goall passed onprofile=2- every remaining mixed
C <-> Rust/Goprofile=2failure broke with either:message size does not match headerinvalid SHM frame length- server/client timeouts after a broken receive path
- concrete cause:
- Rust and Go had moved legacy frame compatibility through the new generic message validators, so old 64-byte
incrementframes were being parsed as variable-message envelopes - after that was fixed, a second mixed-language failure remained:
- the new Windows SHM header offsets did not match across languages
- C defined the header layout with:
spin_triesat offset32req_lenat offset36resp_lenat offset40
- Rust and Go were incorrectly using:
req_lenat32resp_lenat36spin_triesat40
- Rust and Go had moved legacy frame compatibility through the new generic message validators, so old 64-byte
- fixes applied:
- Rust and Go now keep raw frame compatibility on SHM separate from variable-message validation
- Rust and Go SHM constants were aligned to the C header layout
- C now has
_Static_assert(offsetof(...))checks for the critical Windows SHM header fields to prevent future drift
- first
- Validation:
- local compile sanity:
- Rust:
cargo check --manifest-path src/crates/netipc/Cargo.toml --target x86_64-pc-windows-gnu - Go:
GOOS=windows GOARCH=amd64 go test ./...
- Rust:
- real Windows runtime on
win11:cargo test --manifest-path src/crates/netipc/Cargo.tomlgo test ./...bash tests/smoke-win.shpassed with32 passed, 0 failed
- the fake
cgroupsmethodology is now validated on Windows across both:- baseline Named Pipe
- unified SHM
bash tests/run-live-cgroups-win.shpassed with64 passed, 0 failed
- the fake
cgroupsbenchmark coverage is now also validated on real Windows across both:- baseline Named Pipe
- unified SHM
bash tests/run-live-cgroups-win-bench.shpassed onwin11after the Go teardown fixes below
- local compile sanity:
- Additional real issue found and fixed during the Windows fake
cgroupsSHM widening:- the first real
win11SHM benchmark rerun exposed a lifecycle-only bug in the Go helper path:go-nativeserver +c-nativeclient +shm-hybrid- the benchmark payloads and throughput were correct, but the Go server exited with
server-loop failed: client closed
- concrete causes:
- the high-level Go
cgroupssnapshot.Clienthelper did not exposeClose(), so benchmark/smoke clients could exit without explicitly closing the underlying transport - the Go Windows SHM server-side receive path returned a plain
errors.New("client closed"), which the benchmark fixture did not classify as a graceful disconnect
- the high-level Go
- fixes applied:
src/go/pkg/netipc/service/cgroupssnapshot/client.go- added
Client.Close() disconnectTransport()now uses the same close path
- added
tests/fixtures/go/main.go- all fake
cgroupsclient commands nowdefer client.Close()
- all fake
tests/fixtures/go/cgroups_bench.go- both refresh and lookup benchmark clients now
defer client.Close()
- both refresh and lookup benchmark clients now
src/go/pkg/netipc/transport/windows/pipe.go- the SHM server receive path now returns
io.EOFon client-close instead of a plain string error
- the SHM server receive path now returns
- implication:
- Windows fake
cgroupsruntime and benchmark teardown now use structured graceful-disconnect semantics across C, Rust, and Go instead of relying on ad-hoc string matching
- Windows fake
- the first real
- Implemented on 2026-03-13 in the Windows SHM paths:
- Current status:
- the unified SHM model is now validated for the legacy fixed-frame compatibility path in:
- POSIX:
- C
- Rust
- Windows:
- C native
- C msys
- Rust native
- Go native
- POSIX:
- the approved snapshot/cache methodology is also validated on Windows baseline transports with:
- C native producer/client
- C msys producer/client
- Rust native producer/client
- Go native producer/client
- the approved snapshot/cache methodology is now also validated on Windows over the unified SHM path with the same four producer/client implementations
- the remaining practical gap is narrower now:
- the fake
cgroupshelper on Windows still defaults to Named Pipe unless profiles are overridden - public helper/profile selection policy still needs to be finalized later; the runtime methodology itself is now validated on both Windows baseline and SHM paths
- the fake
- the unified SHM model is now validated for the legacy fixed-frame compatibility path in:
- Go POSIX SHM implementation strategy:
- Status: pending user decision discovered on 2026-03-13 while widening the Linux fake
cgroupsSHM matrix fromC -> CtoC/Rust/Go. - Context:
- the widened Linux fake
cgroupsSHM smoke immediately exposed that the remaining gap is no longer in the fake snapshot/cache helper layer. - the actual gap is the underlying Go POSIX transport:
- the Linux SHM fake
cgroupsmatrix currently passes forCandRust - the first
Goparticipant fails at connection/negotiation time withoperation not supported
- the Linux SHM fake
- the widened Linux fake
- Evidence:
tests/run-live-cgroups-shm.shwas widened fromC -> Conly to aC/Go/Rustproducer-client matrix- targeted repro:
C server + Go clientwithNETIPC_SUPPORTED_PROFILES=2 NETIPC_PREFERRED_PROFILES=2- client error:
client-refresh-once failed: operation not supported Go server + C clientwith the same profile override:- server error:
server-once failed: operation not supported - client side then sees
Connection reset by peer
- server error:
- current Go POSIX transport facts:
src/go/pkg/netipc/transport/posix/seqpacket.go- declares
ProfileSHMHybrid - but
implementedProfiles = ProfileUDSSeqpacket - so profile
2is still not implemented on the Go POSIX path
- declares
- current C and Rust POSIX SHM facts:
- C uses unnamed shared-memory POSIX semaphores (
sem_t) embedded in the shared region:src/libnetdata/netipc/src/transport/posix/netipc_shm_hybrid.csem_t req_sem;sem_t resp_sem;sem_init,sem_post,sem_timedwait,sem_destroy
- Rust mirrors the same shared-region semaphore model through
libc::sem_t:src/crates/netipc/src/transport/posix.rs
- C uses unnamed shared-memory POSIX semaphores (
- Problem:
- to make Go interoperate with the already-implemented unified POSIX SHM model, the Go Unix transport needs access to the same unnamed process-shared semaphore primitives (
sem_init,sem_post,sem_timedwait,sem_destroy) used by C and Rust. - pure-Go
syscall/x/sys/unixdo not expose those POSIX semaphore functions directly. - Additional researched facts:
- Go and
x/sys/unixdo expose raw Linux futex and eventfd syscalls at the syscall-number level: - local evidence:
/usr/lib/go/src/syscall/zsysnum_linux_amd64.godefinesSYS_FUTEX/home/costa/go/pkg/mod/golang.org/x/sys@v0.41.0/unix/zsysnum_linux_amd64.godefinesSYS_FUTEX,SYS_FUTEX_WAIT,SYS_FUTEX_WAKE, andSYS_EVENTFD2
- primary references:
futex(2)/futex(7)onman7.orgeventfd(2)onman7.org
- Go and
- other shared-memory IPC projects do not solve this by embedding POSIX unnamed semaphores into the shared memory region:
/tmp/shmipc-go-purego/README.md- pure-Go Go SHM IPC uses Unix/TCP synchronization plus shared memory queues
/tmp/shmem-ipc-rs/src/sharedring.rs- Rust shared-memory ring buffer uses
eventfdfor signaling, notsem_t
- Rust shared-memory ring buffer uses
/tmp/iceoryx-shm/doc/website/advanced/iceoryx-on-32-bit.md- explicitly notes that production spin primitives should become
futexon Linux andWaitOnAddresson Windows
- explicitly notes that production spin primitives should become
- futex is a Linux userspace API, not a portable Unix API:
- primary reference:
futex(2)/futex(7)onman7.org
- primary reference:
eventfdis also Linux-specific:- primary reference:
eventfd(2)onman7.org
- primary reference:
- local syscall metadata shows System V semaphore syscalls exist across Linux, FreeBSD, and Darwin targets in
x/sys/unix, while futex does not provide that same portable Unix story
- to make Go interoperate with the already-implemented unified POSIX SHM model, the Go Unix transport needs access to the same unnamed process-shared semaphore primitives (
- Decision needed:
- we need one synchronization primitive for the Unix SHM path that:
- is usable from C, Rust, and pure Go
- does not require
cgo - still fits the approved single unified SHM design instead of creating a Go-only fork
- we need one synchronization primitive for the Unix SHM path that:
- Options:
- A. Redesign Linux POSIX SHM synchronization around shared-memory futex words
- Pros:
- pure Go can use
SYS_FUTEXvia raw syscalls - C and Rust can use the same futex words directly
- stays inside the shared memory region, so it still matches the unified SHM design
- maps conceptually well to Windows
WaitOnAddress
- pure Go can use
- Cons:
- Linux-specific, not generic POSIX
- requires reworking already-working C and Rust Linux SHM synchronization
- Implications:
- the Unix SHM implementation becomes Linux-first instead of generic-POSIX-first
ProfileSHMFutexcan become the real Linux SHM profile and replace the current semaphore-backed path
- Risks:
- medium implementation risk
- low long-term drift risk
- Pros:
- B. Replace embedded
sem_twith external System V semaphores- Pros:
- pure Go can drive
semget/semop/semtimedopvia raw syscalls - C and Rust can also interoperate with them
- pure Go can drive
- Cons:
- synchronization state moves outside the shared memory region
- lifecycle/cleanup becomes more complex
- weaker fit for the approved unified SHM design
- Implications:
- SHM would depend on extra kernel semaphore objects keyed outside the mapped region
- Risks:
- medium to high operational complexity
- more cleanup corner cases
- Pros:
- C. Keep the current semaphore-backed Linux SHM and leave Go POSIX SHM unsupported
- Pros:
- no redesign now
- Cons:
- Linux
C/Rust/GoSHM methodology remains incomplete - contradicts the goal of complete cross-language coverage
- Linux
- Implications:
- Go would stay baseline-only on Linux while Windows supports SHM
- Risks:
- high product inconsistency and permanent documentation complexity
- Pros:
- Recommendation:
- A. Redesign Linux POSIX SHM synchronization around futex words.
- Reason:
- it is the only option that stays pure-Go, preserves one shared-memory control model, and gives us a clean conceptual bridge to Windows
WaitOnAddressinstead of growing another semaphore subsystem.
- it is the only option that stays pure-Go, preserves one shared-memory control model, and gives us a clean conceptual bridge to Windows
- A. Redesign Linux POSIX SHM synchronization around shared-memory futex words
- Clarification:
Ais a Linux-first solution.- it will not give us one shared Unix SHM implementation for macOS, FreeBSD, and Linux.
- if the requirement is one pure-Go Unix SHM methodology across Linux + FreeBSD + macOS, we need a different decision than
A.
- Unix SHM portability target:
- Status: resolved by user decision on 2026-03-13.
- Context:
- the approved SHM architecture is one unified SHM subsystem and one unified protocol/control model.
- that does not require one identical kernel synchronization primitive on every Unix.
- the real question is whether Unix SHM itself must be first-class on:
- Linux
- FreeBSD
- macOS
- Evidence:
74.Arelies onfutex, which is Linux-specific:- see
TODO-plugin-ipc.mdDecision74 - primary reference:
futex(2)/futex(7)onman7.org
- see
eventfdis also Linux-specific:- primary reference:
eventfd(2)onman7.org
- primary reference:
- local syscall metadata shows System V semaphore syscalls across Linux, FreeBSD, and Darwin targets in
x/sys/unix, while futex does not give that same portable Unix story
- Decision:
- implement Unix SHM for Linux
- FreeBSD and macOS fall back to UDS for now
- Implication:
- the unified SHM protocol/layout remains the design target
- the first Unix SHM backend is Linux-only
- FreeBSD/macOS stay baseline-only until a native pure-Go backend is justified later
- Options:
- A. Linux-first SHM
- Meaning:
- Linux gets unified SHM using futex
- macOS and FreeBSD stay baseline-only for now
- Pros:
- fastest path to finish
- lowest short-term implementation risk
- Cons:
- Unix SHM is not portable
- Implications:
- docs/tests must explicitly say SHM-on-Unix means Linux only
- Risks:
- medium design debt if macOS/FreeBSD matter later
- Meaning:
- B. Force one portable Unix synchronization primitive for SHM
- Meaning:
- choose a single primitive that can work across Linux + FreeBSD + macOS, likely external semaphores rather than futex
- Pros:
- one Unix synchronization story
- Cons:
- likely weaker fit than futex on Linux
- higher complexity and cleanup burden
- risks pulling the design away from the current shared-memory control model
- Implications:
- Linux gives up its best native primitive for portability
- Risks:
- medium to high performance/complexity risk
- Meaning:
- C. One unified SHM API/layout, but OS-specific synchronization backends
- Meaning:
- same public SHM behavior, same control/data layout, same protocol
- Linux uses futex
- FreeBSD/macOS use their best pure-Go-usable backend
- Pros:
- preserves one product design
- preserves Linux performance
- keeps portability possible without forcing a bad common denominator
- Cons:
- more implementation work
- more OS-specific testing
- Implications:
- the sync backend becomes a platform detail, not a protocol detail
- Risks:
- medium implementation risk
- Meaning:
- A. Linux-first SHM
- User choice:
- A. Linux-first SHM.
- Derived implementation constraint:
- keep the public Unix fast SHM profile as
profile 2/SHM_HYBRID - replace only the Linux synchronization backend under that profile
- use one shared control/data layout for all Unix builds:
- replace embedded
sem_tobjects with sharedreq_signal/resp_signalwords - Linux waits/wakes those words with futex
- non-Linux Unix builds stop advertising SHM and stay on UDS
- replace embedded
- keep the public Unix fast SHM profile as
- Reason:
- the current docs, scripts, helper layer, and validation already treat
profile 2as the fast Unix SHM path - changing the public profile now would create unnecessary spec drift
- the current docs, scripts, helper layer, and validation already treat
- Negotiated payload and batch limits:
- Status: historical, superseded by Decision #64 on 2026-03-12.
- Source: early user decision on 2026-03-12 before the handshake ownership model was fully stress-tested against snapshot/cache services.
- Decision:
max_payload_bytesrefers to one single request payload or one single response payload- default
max_payload_bytesshould be1024 max_batch_itemsremains a negotiated limitmax_batch_bytesshould be derived frommax_payload_bytes * max_batch_items
- Implication:
- the
1024default remains valid - the symmetric handshake reading here is no longer authoritative; Decision #64 replaced it with directional request/response limits and per-direction derived batch bytes
- the
- Go representation for zero-allocation string views:
- Source: user decision on 2026-03-12 (
18.B). - Decision:
- Go should expose a dedicated wrapper type such as
StringView/CStringViewfor decoded ephemeral borrowed string fields - Go should not expose normal
stringvalues by default for decoded zero-allocation views
- Go should expose a dedicated wrapper type such as
- Implication:
- the Go API itself will signal borrowed/view semantics instead of looking like an ordinary durable string API
- explicit copy helpers can exist for callers that need owned
stringvalues outside the current call/callback lifetime
-
- Per-method payload layout discipline.
- Resolved by Decisions #47, #49, #50, #53, and the derived protocol sketch:
- each method payload is self-contained
- each method uses a small fixed method-local header plus method-local scalar fields and
offset + lengthmembers for variable data
-
- String field representation inside method payloads.
- Resolved by Decision #50 (
Option A):- strings are represented by
offset + length - the pointed bytes must also end with
\\0
- strings are represented by
-
- Decoded payload/view lifetime model.
- Resolved by Decisions #48, #51, #52, and #56:
- decoded request/response objects are non-owning ephemeral views
- they are valid only during the current library call / callback
-
- Go representation for zero-allocation string views.
- Resolved by Decision #55 (
Option B).
-
- High-level API shape under the strict ephemeral-view lifetime rule.
- Resolved by Decision #56 (
Option A).
- High-level zero-copy API shape under the strict ephemeral-view lifetime rule:
- Source: user decision on 2026-03-12 (
19.A). - Decision:
- high-level zero-copy APIs must be callback-based
- the library invokes the caller callback with the decoded response view while that view is still valid
- high-level APIs must not return ephemeral decoded views directly to their callers
- Implication:
- callers must consume or copy response data inside the callback
- explicit copy/materialize helpers may still exist as slower convenience paths
- the strict ephemeral lifetime rule remains honest and enforceable in C, Rust, and Go
- First concrete method schema for Phase 2:
- Status: pending user decision discovered during Phase 1 implementation on 2026-03-12.
- Context:
- executable code still only implements the synthetic
incrementmethod - the
ip_to_asnmethod in this TODO is still an illustrative design example, not a frozen production schema
- executable code still only implements the synthetic
- Options:
- Option A:
- freeze the real
ip_to_asnv1 request/response schema now and use it as the first concrete method implementation
- freeze the real
- Option B:
- introduce a temporary synthetic string-bearing method only to build the generic view/builder machinery, then replace it later with the first real production method
- Option C:
- keep Phase 2 deferred for now and continue Phase 3 transport migration on top of the compatibility
incrementmethod until the first real production method schema is frozen
- keep Phase 2 deferred for now and continue Phase 3 transport migration on top of the compatibility
- Option A:
- Cache-backed high-level service helpers:
- Status: pending user decision raised on 2026-03-12 while evaluating whether
plugin-ipcshould replace the existingcgroups.plugin -> ebpf.pluginchannel. - Context:
- the existing
ebpf <-> cgroupsintegration is a periodically refreshed shared snapshot thatebpf.pluginturns into a local cache - the cached state is then reused across many hot paths instead of making a request for every lookup
- evidence:
- producer populates the shared snapshot in
/home/costa/src/netdata/netdata/src/collectors/cgroups.plugin/cgroup-discovery.c - consumer refreshes local cached state in
/home/costa/src/netdata/netdata/src/collectors/ebpf.plugin/ebpf_cgroup.c - hot paths later consume that cached state, for example in
/home/costa/src/netdata/netdata/src/collectors/ebpf.plugin/ebpf_process.c
- producer populates the shared snapshot in
- the existing
- Options:
- Option A:
- keep the library generic and expose only transport/service primitives; each plugin owns any local cache it needs
- Option B:
- add a higher-level library layer for cache-backed services, so callers can use helpers such as
refresh_cgroups()andlookup_cgroup()
- add a higher-level library layer for cache-backed services, so callers can use helpers such as
- Option C:
- support both, with cache-backed helpers built strictly on top of the generic client/service layer
- Option A:
- User decision:
- Option C approved on 2026-03-12
- Decision:
plugin-ipckeeps a generic typed client/service core- the library also provides optional cache-backed high-level helpers for snapshot-style services
- cache-backed helpers must be implemented strictly on top of the generic client/service layer, not as a separate transport path
- Implication:
- lookup-style services and snapshot/cache-style services can coexist without forcing one model on all providers
cgroups.plugin -> ebpf.pluginbecomes a valid target use case for the helper layer- layering must stay strict so cache helpers cannot drift into a second IPC implementation
- No Netdata-repo integration before a fully tested fake producer exists:
- Source: user decision on 2026-03-12.
- Decision:
- before touching the Netdata repository for real integration work,
plugin-ipcmust first gain a fake producer with dummy data and full tests - the fake producer must exercise the intended snapshot/cache-backed service flow end-to-end inside this repository
- before touching the Netdata repository for real integration work,
- Implication:
- the first implementation target is a self-contained fake service/provider plus client-side cache helper and tests
- Netdata-agent integration stays blocked until the fake service validates the protocol, API, cache lifecycle, and test coverage
- First fake producer/service target before Netdata integration:
- Status: approved on 2026-03-12 after Decision #59.
- Context:
- this repository currently has only
increment-based fixtures, helpers, and tests - evidence:
- C fixture/codec tool:
/home/costa/src/plugin-ipc.git/tests/fixtures/c/netipc_live_c.c,/home/costa/src/plugin-ipc.git/tests/fixtures/c/netipc_codec_tool.c - Go fixture:
/home/costa/src/plugin-ipc.git/tests/fixtures/go/main.go - Rust fixture:
/home/costa/src/plugin-ipc.git/tests/fixtures/rust/src/bin/netipc_live_rs.rs - transport APIs are still
receive_increment/send_increment/call_incrementoriented in C, Rust, and Go
- C fixture/codec tool:
- the newly approved cache-backed helper layer should be validated with a fake producer before any Netdata-repo integration
- this repository currently has only
- Options:
- Option A:
- make the first fake service a
cgroups-style snapshot/cache service with dummy cgroup/container records and a cache-backed client helper
- make the first fake service a
- Option B:
- make the first fake service an
ip_to_asn-style request/response lookup service, and postpone cache-backed snapshot validation
- make the first fake service an
- Option C:
- make the first fake service a tiny synthetic snapshot service unrelated to real Netdata domains, just to validate the cache/helper mechanics before the real schema is chosen
- Option A:
- User decision:
- Option A approved on 2026-03-12
- Decision:
- the first fake producer/service in this repository will be a
cgroups-style snapshot/cache service with dummy cgroup/container records - it must include a cache-backed client helper and full end-to-end tests before any Netdata-repo integration work begins
- the first fake producer/service in this repository will be a
- Implication:
- the next implementation target is a fake snapshot producer plus cache refresh/lookup helper, not
ip_to_asn - the first real method family and test fixtures should validate snapshot refresh, local cache rebuild/use, and dummy cgroup/container lookups
- the next implementation target is a fake snapshot producer plus cache refresh/lookup helper, not
- Fake
cgroupssnapshot schema scope:
- Status: approved on 2026-03-12.
- Context:
- the first fake service is now a
cgroups-style snapshot/cache service - the current Netdata SHM layout contains transport/internal fields and historical details that should not be copied blindly into the new public service contract
- the first fake service is now a
- Options:
- Option A:
- mirror the current Netdata SHM layout closely
- Option B:
- define a cleaned public snapshot contract for
plugin-ipc
- define a cleaned public snapshot contract for
- Option C:
- keep a cleaned public contract plus optional compatibility/debug fields
- Option A:
- User decision:
- Option B approved on 2026-03-12
- Decision:
- the fake
cgroupssnapshot service should use a cleaned publicplugin-ipccontract, not a direct copy of the current Netdata SHM structs
- the fake
- Implication:
- the fake service validates the new library design rather than preserving historical SHM baggage
- later Netdata integration may still require an adaptation layer from current producer data to the cleaned service contract
- Snapshot-style services must not derail core transport/protocol implementation:
- Source: user clarification on 2026-03-12.
- Decision:
- underlying protocol/transport implementation, correctness testing, and performance coverage remain the primary priority
- the new cache-backed snapshot helper layer is a consumer of that foundation, not a replacement for it
- Fact:
- snapshot-style services invert data-shape ownership compared to the current request-sized flow:
- for lookup/request-response traffic, the client shapes the request payload
- for snapshot publication, the server determines item count, packed payload sizes, and total snapshot size
- snapshot-style services invert data-shape ownership compared to the current request-sized flow:
- Implication:
- the protocol and transport layers must support both client-shaped request payloads and server-shaped snapshot payloads cleanly
- the fake snapshot service should be used to stress the generic protocol implementation, not to bypass it
- Transport scope of the first fake snapshot/cache service:
- Status: approved on 2026-03-12 via the strict phased execution plan.
- Context:
- the current reusable transport layer is still effectively
increment-shaped in C, Rust, and Go - baseline transports (
UDS_SEQPACKET/Named Pipe) already carry discrete messages naturally - current SHM implementations are still single-slot fixed-frame paths and do not yet model server-owned variable-size snapshot publication cleanly
- evidence:
- C API still exposes
receive_increment/send_increment/call_incrementin/home/costa/src/plugin-ipc.git/src/libnetdata/netipc/include/netipc/netipc_uds_seqpacket.h - Rust POSIX transport still exposes
receive_increment/send_increment/call_incrementin/home/costa/src/plugin-ipc.git/src/crates/netipc/src/transport/posix.rs - Go POSIX transport still exposes
ReceiveIncrement/SendIncrement/CallIncrementin/home/costa/src/plugin-ipc.git/src/go/pkg/netipc/transport/posix/seqpacket.go - current SHM implementations remain single-slot request/response paths in:
- C API still exposes
- the current reusable transport layer is still effectively
- Options:
- Option A:
- implement and fully test the fake snapshot/cache service first on baseline transports only (
UDS_SEQPACKETon POSIX,Named Pipeon Windows), then redesign SHM for snapshot publication in a later phase
- implement and fully test the fake snapshot/cache service first on baseline transports only (
- Option B:
- require the first fake snapshot/cache service to work on SHM too from the start
- Option C:
- implement the fake snapshot service first on baseline transports, but also add SHM-specific protocol tests/artifacts immediately even if the full snapshot service does not run on SHM yet
- Option A:
- User decision:
- Option A approved on 2026-03-12 as part of the approved strict phase plan (
Phase 2baseline transport migration,Phase 7SHM redesign)
- Option A approved on 2026-03-12 as part of the approved strict phase plan (
- Decision:
- the first fake snapshot/cache service will be implemented and fully tested first on baseline transports only
- SHM redesign remains required, but it follows only after the fake snapshot/cache service is correct, tested, and performance-covered on the baseline transports
- Implication:
- the protocol and API can stabilize before the more difficult SHM publication redesign starts
- SHM is still in scope for the full project, but it is no longer a blocker for starting the fake producer/service work
- Handshake ownership model for request and response sizing:
- Status: approved on 2026-03-12 before fake snapshot service implementation.
- Context:
- the current protocol core and TODO still model sizing as symmetric:
- one
max_payload_bytes - one derived
max_batch_bytes - no separate request-owned vs response-owned limits
- one
- evidence:
- C hello payload currently carries only
max_batch_items,max_batch_bytes, andmax_payload_bytesin/home/costa/src/plugin-ipc.git/src/libnetdata/netipc/include/netipc/netipc_schema.h - C implementation encodes/decodes only that symmetric set in
/home/costa/src/plugin-ipc.git/src/libnetdata/netipc/src/protocol/netipc_schema.c - Rust protocol core mirrors the same fields in
/home/costa/src/plugin-ipc.git/src/crates/netipc/src/protocol.rs - current TODO decision #54 also defines
max_batch_bytes = max_payload_bytes * max_batch_items
- C hello payload currently carries only
- snapshot/cache services introduce asymmetric ownership:
- client shapes request payload sizes
- server shapes response/snapshot payload sizes
- the current protocol core and TODO still model sizing as symmetric:
- Options:
- Option A:
- keep one symmetric
max_payload_bytescontract and force both sides to live within it
- keep one symmetric
- Option B:
- negotiate separate request and response payload limits, with client controlling request sizing and server controlling response sizing
- Option C:
- negotiate separate request and response limits plus separate batch limits for client-batched requests and server-published snapshots
- Option A:
- User decision:
- Option C approved on 2026-03-12
- Decision:
- the handshake must negotiate directional limits instead of one symmetric payload contract
- ownership of payload shape stays in the compiled service/method contract
- runtime handshake negotiates ceilings for each direction separately
- batch bytes remain derived per direction, not independently negotiated
- Directional contract:
- request direction:
max_request_payload_bytesmax_request_batch_items
- response direction:
max_response_payload_bytesmax_response_batch_items
- request direction:
- Implication:
- Decision #54 is superseded in protocol shape by this directional model
- hello / hello-ack payloads in C, Rust, and Go must change before fake snapshot implementation begins
- snapshot/cache services and request/response lookup services can now share one stable handshake model without forcing another redesign later
- Overall execution model for implementing the project:
- Status: approved on 2026-03-12 after agreeing that the project must be delivered in phases.
- Context:
- Phase 1 protocol-core work already landed, but its current hello/hello-ack fields are now outdated by Decision #64
- fake producer work is required before any Netdata-repo integration (Decision #59)
- snapshot/cache services are now in scope (Decision #58), but SHM still uses a fixed single-slot request/response layout today
- evidence:
- current phase plan starts at
TODO-plugin-ipc.md - current hello/hello-ack structs are still symmetric in
/home/costa/src/plugin-ipc.git/src/libnetdata/netipc/include/netipc/netipc_schema.h - current SHM path is still fixed-frame/single-slot in
/home/costa/src/plugin-ipc.git/src/libnetdata/netipc/src/transport/posix/netipc_shm_hybrid.c
- current phase plan starts at
- Options:
- Option A:
- do the work in strict phases, with each phase frozen, tested, and documented before the next one starts
- Option B:
- implement protocol, fake service, cache helpers, and SHM redesign in parallel to reduce total wall-clock time
- Option C:
- do baseline transports and fake service first, but allow opportunistic partial SHM work in parallel when convenient
- Option A:
- User decision:
- Option A approved on 2026-03-12
- Decision:
- the project will be implemented in strict phases
- each phase must be frozen, tested, and documented before the next phase begins
- Implication:
- implementation can proceed autonomously within each approved phase
- phase boundaries become explicit review/control points and reduce protocol/API drift
- Fake cleaned
cgroupssnapshot service contract:
- Status: approved on 2026-03-12 with real Netdata reuse as the target.
- Context:
- the current Netdata producer shares:
- header:
cgroup_root_countcgroup_maxsystemd_enabledbody_length
- item:
namehashoptionsenabledpath
- evidence:
- header:
- the current
ebpfconsumer actually uses:hash+nameas identityoptionsto derivesystemdpathto readcgroup.procs- evidence:
- user requirement:
- the fake contract and fake test must be suitable to be used later with the real
cgroupsandebpfplugins exactly as-is - implication:
- this fake service is no longer just a representative rehearsal
- it must preserve the real consumer semantics, while still excluding clearly transport/internal fields like
body_length
- the fake contract and fake test must be suitable to be used later with the real
- the current Netdata producer shares:
- Options:
- Option A:
- minimal cleaned snapshot
- snapshot metadata:
generationitem_countsystemd_enabled
- item fields:
namepathoptionsenabled
- Pros:
- cleanest public contract
- no derived/internal fields
- Cons:
- less direct rehearsal for the current
ebpfconsumer, which keys onhash + name
- less direct rehearsal for the current
- Implications:
- later Netdata integration computes any needed identity locally
- Risks:
- may hide one real consumer need until integration
- Option B:
- cleaned snapshot with explicit stable identity
- snapshot metadata:
generationitem_countsystemd_enabled
- item fields:
stable_idnamepathoptionsenabled
- Pros:
- still cleaned
- closer to the current
ebpfconsumer behavior - easier future Netdata adaptation
- Cons:
- one more public field to define and maintain
- Implications:
- we must define what
stable_idmeans in the fake service
- we must define what
- Risks:
- if the identity story changes later, this may need adjustment
- Option C:
- infrastructure-only toy subset
- snapshot metadata:
generationitem_count
- item fields:
namepathenabled
- Pros:
- fastest to implement
- Cons:
- weaker rehearsal for the real replacement
- Implications:
- likely follow-up schema churn later
- Risks:
- validates the mechanics, but not enough of the real service contract
- Option A:
- User decision:
- the fake contract and fake test must be suitable to be used later with the real
cgroupsandebpfplugins exactly as-is
- the fake contract and fake test must be suitable to be used later with the real
- Current accepted direction after user clarification:
- the fake contract must be Netdata-ready for the real
cgroups/ebpfreplacement path - transport/internal fields remain excluded
- semantic fields actually used by the consumer remain in scope
- the fake contract must be Netdata-ready for the real
- Decision:
- the fake/public
cgroupssnapshot contract must be suitable for later use by the realcgroupsandebpfplugins without redesign - transport/internal fields such as
body_lengthremain excluded - semantic fields actually used by the real consumer remain in scope
- the fake/public
- First fake snapshot refresh contract:
- Status: approved on 2026-03-12.
- Context:
- the current
ebpfside refreshes local state periodically from the producer snapshot - this fake service is meant to validate the cache-backed helper layer before Netdata integration
- the current
- Options:
- Option A:
- always return the full snapshot
- Request:
- no payload or only a trivial control field
- Response:
- full snapshot every refresh
- Pros:
- simplest
- best first implementation
- easiest to test and benchmark
- Cons:
- no
unchangedshort-circuit yet
- no
- Implications:
- good fit for the first fake service
- Risks:
- some extra traffic during refreshes
- Option B:
- generation-aware refresh
- Request:
known_generation
- Response:
- either
unchangedor a full snapshot with a new generation
- either
- Pros:
- closer to a production-style cache refresh API
- less unnecessary refresh traffic
- Cons:
- more state and protocol logic in the first service
- Implications:
- the cache helper gets more realistic behavior early
- Risks:
- more implementation complexity in the first real method family
- Option C:
- pagination / partial refresh
- Pros:
- most scalable
- Cons:
- too much for the first fake service
- Implications:
- not fit for the current phased plan
- Risks:
- slows the project for low immediate value
- Option A:
- User decision:
- Option A approved on 2026-03-12
- Decision:
- the first fake snapshot service will always return the full snapshot on refresh
- generation-aware or differential refresh stays out of the first implementation
- Implication:
- the first cache-backed helper validates the core protocol, snapshot decode/build path, and local cache logic first
- refresh optimization can come later without destabilizing the first real service family
- Snapshot/cache helper methodology on Windows:
- Status: approved on 2026-03-12.
- Context:
cgroups -> ebpfitself is Linux-only, but the higher-level pattern under discussion is broader:- server-owned snapshot publication
- client-side cache rebuild/lookup helpers
- refresh-oriented local cache use instead of request-per-lookup
- current evidence in this repo already points toward a cross-platform implementation model:
- the approved project plan makes cache-backed helpers part of the library above the generic typed IPC core, not a Linux-only special case
- the current C helper implementation is already written to select POSIX UDS on POSIX and named pipes on Windows/MSYS/Cygwin
- baseline transports on both POSIX and Windows now expose generic variable-message APIs
- evidence:
- cache-backed helper direction approved in Decision #58
- baseline-transport-first scope approved in Decision #63
- directional request/response handshake approved in Decision #64
- current cross-platform helper selection in:
src/libnetdata/netipc/src/service/netipc_cgroups_snapshot.csrc/libnetdata/netipc/include/netipc/netipc_named_pipe.hsrc/libnetdata/netipc/include/netipc/netipc_uds_seqpacket.h
- Question:
- should the snapshot/cache helper methodology itself be treated as cross-platform library architecture, even though the first real service example (
cgroups) is Linux-only?
- should the snapshot/cache helper methodology itself be treated as cross-platform library architecture, even though the first real service example (
- Options:
- Option A:
- yes, make snapshot/cache helpers a cross-platform library pattern
- the first concrete schema remains Linux-specific (
cgroups), but the architecture and test methodology must also work on Windows with future Windows-native services
- Option B:
- no, keep snapshot/cache helpers effectively Linux-specific for now and use only generic request/response services on Windows
- Option C:
- keep the core generic on Windows, but defer any explicit commitment that snapshot/cache helpers are part of the Windows architecture until a real Windows service is implemented
- Option A:
- User decision:
- Option A approved on 2026-03-12
- Decision:
- snapshot/cache helpers are part of the cross-platform library architecture
- the first real schema example (
cgroups) remains Linux-specific - Windows must still support and test the same refresh/cache methodology with its own transports and future Windows-native services
- Implication:
- Windows baseline validation is required for the helper methodology itself, even though
cgroupsis not a Windows service - later Windows-native snapshot services should reuse the same helper architecture instead of inventing a separate model
- Windows baseline validation is required for the helper methodology itself, even though
- Exact identity field for the fake
cgroupscontract:
- Status: approved on 2026-03-12 before schema implementation.
- Context:
- the current producer exports
hashand the current consumer useshash + nameas identity:- producer writes
ptr->hash = simple_hash(ptr->name)in/home/costa/src/netdata/netdata/src/collectors/cgroups.plugin/cgroup-discovery.c - consumer keys on
ect->hash == ptr->hash && !strcmp(ect->name, ptr->name)in/home/costa/src/netdata/netdata/src/collectors/ebpf.plugin/ebpf_cgroup.c
- producer writes
- user requirement is that the fake contract should later be usable with the real plugins exactly as-is
- the current producer exports
- Options:
- Option A:
- expose the field as
hash - Pros:
- closest to the current producer/consumer reality
- least adaptation later
- best fit for the “exactly as-is” goal
- Cons:
- keeps a somewhat implementation-flavored field name in the public contract
- Implications:
- the fake service mirrors the real identity semantics directly
- Risks:
- if Netdata later wants a stronger identity story than the current hash, this field may need revision
- expose the field as
- Option B:
- expose the field as
stable_id - Pros:
- cleaner public abstraction
- Cons:
- immediate mismatch with the current real consumer/producer naming and semantics
- Implications:
- later integration still needs adaptation or renaming logic
- Risks:
- contradicts the “exactly as-is” goal
- expose the field as
- Option A:
- User decision:
- Option A approved on 2026-03-12
- Decision:
- the fake/public
cgroupscontract will expose the identity field ashash - the fake service preserves the current real identity semantics of
hash + name
- the fake/public
- Implication:
- the fake service is aligned with the current Netdata producer/consumer contract where it matters semantically
- later integration should require less adaptation logic
- Derived service-specific size limits for the fake
cgroupssnapshot service:
- Status: derived from current Netdata facts on 2026-03-12; no user decision required.
- Facts:
- the current real producer schema uses:
name[256]path[FILENAME_MAX + 1]- evidence:
- the current real producer also defaults
cgroup_root_maxto1000:
- the current real producer schema uses:
- Derived implications:
- the generic negotiated payload default of
1024is not enough for the real cleanedcgroupssnapshot item payload - the fake
cgroupssnapshot helper must therefore use service-specific defaults instead of the generic protocol default
- the generic negotiated payload default of
- Decision:
- the fake
cgroupssnapshot helper layer will use service-specific response sizing defaults derived from the real Netdata contract - request-side defaults remain small because the first refresh request payload is only the fixed 4-byte request payload
- response-side default batch item count should match the current real
cgroup_root_maxdefault of1000
- the fake
-
Fact: several older items below reflect the earlier benchmark/prototype phase and now contradict the current library design direction.
-
Cleanup rule for this TODO:
- keep historical decisions for traceability
- but treat the following items as superseded for current architecture/planning
-
Superseded by Decisions #41-#44:
- Decision #5 (
batch only if benchmarks fail the target) - Current direction is now batch-first v1 for the advanced-throughput path, with general pipelining deferred if ordered batch is sufficient.
- Decision #5 (
-
Superseded in implementation priority by Decisions #41-#44:
- Decision #11 (
strict ping-pong baseline plus secondary pipelined mode) - Current direction is:
- keep ping-pong benchmarks
- add ordered-batch benchmarks as the main advanced-throughput validation
- treat general pipelining as a later phase unless measurements prove it is still required
- Decision #11 (
-
Superseded in protocol scope by Decisions #40-#44:
- any assumption that the current fixed 64-byte
INCREMENTframe is the target library protocol - Current direction is a fixed header plus variable payload envelope, because real requests/responses may carry strings and different payload sizes.
- any assumption that the current fixed 64-byte
-
Superseded in server-dispatch direction by Decisions #39 and #42:
- any implicit session-sticky "one worker owns one client forever" managed-server assumption
- Current direction is request-level batch parallelism, where a single client batch may be split across workers and then reassembled in order.
-
User clarification state (2026-03-11):
- Decision
1selected asA: extend the existing Windows scripts to support bothc-nativeandc-msys. - Decision
2selected asB: includec-native <-> c-msysin the transition smoke matrix in addition toc-msys <-> rust-native/go-native. - Decision
3selected as full cross-implementation coverage: benchmark all directed combinations, while keeping each individual run capped at 5 seconds. - Decision
7selected asB: userust-native -> rust-nativeto find the Windows SHM spin candidate knee, then validate the shortlist on representative transition pairs.
- Decision
- Go transport must be pure Go without cgo.
- Source: user decision "The Go implementation must not need CGO."
- Constraint: compatibility with
go.d.pluginpure-Go requirement. - Implementation direction for this phase: use pure-Go shared-memory sequencing and make C/Rust waits semaphore-optional for interoperability with pure-Go peers.
- Current pure-Go polling implementation performance is a blocker and unusable for target plugin IPC.
- Source: user feedback "1k/s is a blocker. This is unusable."
- Implication: v1 pure-Go transport strategy must change; current polling approach cannot be accepted.
- POSIX v1 baseline profile is
UDS_SEQPACKETfor all languages, with optional higher-speed profiles negotiated when both peers support them.
- Source: user decision "ok, 1A".
- Context: measured baseline methods already achieve ~260k-330k req/s class, while current pure-Go polling is blocked at ~1.2k req/s.
- Implication: implement capability negotiation and method/profile selection before request-response starts.
- Handshake frame format is fixed binary struct (v1).
- Source: user agreement to recommendation P2:A.
- Rationale: lower complexity/overhead and aligns with typed fixed-binary IPC model.
- Profile selection is server-selected from intersection (v1).
- Source: user agreement to recommendation P3:A.
- Rationale: deterministic one-round negotiation and simpler state machine.
- Next implementation scope is negotiation +
UDS_SEQPACKETbaseline only.
- Source: user agreement to recommendation P4:A.
- Rationale: fastest unblock path with lower integration risk; optional fast profiles follow in next phase.
- Implement now the v1 scope from Decisions #26-#29.
- Source: user agreement "I agree".
- Scope: C
UDS_SEQPACKETtransport with fixed-binary handshake and server-selected profile.
- Proceed to native Rust/Go live
UDS_SEQPACKETimplementations with the same fixed-binary negotiation and validate full C<->Rust<->Go live matrix.
- Source: user direction "proceed".
- Constraint: Go implementation remains pure Go (no cgo).
- Proceed with next hardening phase for UDS baseline:
- Add live UDS benchmark modes for Rust/Go runners with throughput/p50/CPU reporting.
- Add negative negotiation tests (profile mismatch/auth mismatch/malformed handshake expectations).
- Source: user direction "proceed".
- Proceed with optional fast-profile implementation phase:
- Add negotiated
SHM_HYBRIDprofile support for native C and Rust UDS live runners. - Keep pure-Go on
UDS_SEQPACKETbaseline (no cgo), with negotiation fallback to profile1when peers differ. - Add live interop + benchmark coverage to prove negotiated profile selection behavior and performance impact.
- Source: user direction "proceed".
- Fix rate-limited benchmark clients to avoid busy-loop pacing and switch to adaptive sleep-based pacing.
- Apply to C benchmark harness, Rust live UDS bench client, and Go live bench clients.
- Rerun comparison benchmarks after the pacing fix because previous CPU numbers are polluted by pacing overhead.
- Source: user direction "Please fix the client that busy loops ... run the comparison again."
- Remove pure-Go SHM polling path completely.
- Choice: A (full removal now).
- Scope: remove pure-go-poll command path, remove its dedicated benchmark/testing references, keep UDS baseline interop/bench and C<->Rust negotiated SHM profile coverage.
- Source: user decision "A".
- Start a dedicated new public Netdata repository for this IPC project.
- Source: user direction "create a new public repo in netdata".
- New repository must contain 3 libraries (C, Rust, Go), each supporting POSIX and Windows.
- Source: user direction.
- New repository must enforce complete automated test coverage for library code (target policy: 100%).
- Source: user direction "mandatory ... 100% tested ... 100% coverage on the libraries".
- New repository must include CI benchmark jobs across language role combinations on Linux and Windows.
- Source: user direction "CI benchmarks for all combinations ... 6 linux + 6 windows".
- Netdata Agent integration is deferred until this standalone repo is robust, fully tested, and benchmark-validated.
- Source: user direction "once this is ready ... we will work to integrate it into netdata".
- New public repository identity and ownership: Option B (
netdata/plugin-ipc).
- Source: user decision "1b".
- Coverage policy for "100% tested": Option A (100% line + 100% branch for library source files only; examples/bench binaries excluded).
- Source: user decision "2a".
- Benchmark CI execution model: Option B (all benchmarks run on GitHub-hosted cloud VMs).
- Source: user decision "3b".
- User intent: evaluate benchmark behavior on actual cloud VMs.
- Windows baseline transport: Option A (Named Pipes) for the native Windows path.
- Source: user decision "4a".
- Historical note: this originally assumed one native Windows implementation only; Decision #62 supersedes that assumption for the current transition period.
- Create new repository at
~/src/plugin-ipc.gitand port this IPC project there.
- Source: user direction "creare the repo in ~/src/plugin-ipc.git and port everything there."
- Execution intent: move project source, tests, docs, and tooling baseline into the new repository as the working root for next phases.
- Public project repository path/name is confirmed as
netdata/plugin-ipc.
- Source: user clarification "The project will be netdata/plugin-ipc".
- Note: this matches the earlier repository identity decision and should be treated as fixed.
- Runtime auth source for plugin IPC should use the
NETDATA_INVOCATION_IDenvironment variable carrying the per-agent UUID/session identifier.
- Source: user clarification "The auth is just an env variable with a UUID, I think NETDATA_INVOCATION_ID".
- Status: needs verification against
~/src/netdata-ktsaou.git/before implementation is frozen.
- "Good-enough" acceptance for this standalone project requires reviewing benchmark results before freezing for Netdata integration.
- Source: user clarification "We will need to see benchmark results for good-enough acceptance".
- Implication: benchmark evidence is part of the final go/no-go gate, not just a CI formality.
- Library scope must stay integration-agnostic: the IPC library API accepts the auth UUID/token value from the caller, and does not read environment variables itself.
- Source: user clarification "The API of library should just accept it. Focus on the library itself, not the integration".
- Implication:
NETDATA_INVOCATION_IDlookup/wiring is deferred to Netdata-side integration code, whileplugin-ipcdefines only the parameterized auth contract.
- Repository layout needs a redesign before the project grows into 3 libraries x 2 platforms.
- Source: user feedback "given what we want (a library with 6 implementations), propose the proper file structure. I find the current one a total mess."
- Status: resolved; repository layout decisions recorded below.
- Primary repository-layout criterion is simplicity of eventual integration into the Netdata monorepo.
- Source: user clarification "my primary criteria for the organization is simplicity during integration with netdata."
- Implication: standalone repository aesthetics are secondary to minimizing future integration churn.
- Language deliverables must map directly to Netdata integration targets:
- C implementation must map cleanly into
libnetdata - Go implementation must remain a normal Go package
- Rust implementation must remain a normal Cargo package
- Source: user clarification "Somehow, the C library must be in libnetdata, the go version must be a go pkg and the rust version must be a cargo package."
- Build-system preference is
CMakeas the top-level orchestrator, while preserving native package manifests for Rust and Go.
- Source: user clarification "ideally all makefiles should be cmake, so that we can keep the option of moving the source into netdata monorepo too."
- Implication:
Cargo.tomlandgo.modremain canonical for their languages;CMakeorchestrates mixed-language build, tests, fixtures, and benchmarks.
- Approved standalone repository topology should mirror Netdata's destination structure now:
src/libnetdata/netipc/for the C implementationsrc/go/pkg/netipc/for the Go packagesrc/crates/netipc/for the Rust crate- shared repo-level areas for docs/spec, tests, and benchmarks
- Source: user agreement to the integration-first recommendation.
- Rationale: avoids a second structural rewrite when moving code into the Netdata monorepo.
- Reusable library code must stay separated from demos, interop fixtures, and benchmark drivers.
- Source: user agreement to the proposed organization.
- Implication:
- product code lives under
src/libnetdata/netipc/,src/go/pkg/netipc/, andsrc/crates/netipc/ - helper apps and conformance tools move under repo-level
tests/fixtures/andbench/drivers/
- product code lives under
- Rationale: keeps the library surfaces clean and package-shaped for Netdata integration.
- Root
CMakeLists.txtshould be the single repo entry point for cross-language orchestration.
- Source: user agreement to the proposed build-system recommendation.
- Implication:
- C code builds directly under
CMake - Rust is imported as a Cargo package/crate
- Go is built as a Go module/package tree
- C code builds directly under
- Rationale: matches Netdata's current mixed-language build model.
- Obsolete legacy prototype paths from the pre-refactor tree should be deleted now.
- Source: user decision "cleanup yes".
- Scope:
- remove legacy generated files under
interop/ - remove the old unused root
include/directory - remove empty obsolete
interop/directories if they become empty after cleanup
- remove legacy generated files under
- Rationale: the approved layout is already validated, and keeping stale prototype leftovers would continue to make the tree look partially migrated.
- Status: completed; the obsolete
interop/and rootinclude/paths have been removed.
- Stale top-level binary artifacts from the prototype build flow should be removed, and cleanup commands should keep the repository root artifact-free.
- Source: user report "I see binary artifacts at the root".
- Evidence:
- top-level files currently present:
- the deleted prototype benchmark binary
libnetipc.anetipc-codec-c- the deleted SHM client demo binary
- the deleted SHM server demo binary
- the deleted UDS client demo binary
- the deleted UDS server demo binary
- current documented build outputs already point to
build/bin/andbuild/lib/, not the repository root.
- top-level files currently present:
- Rationale: these are stale leftovers from the earlier prototype layout and should not remain visible after the CMake refactor.
- Status: completed.
- Follow-up fix:
- the root
Makefilewrapper was adjusted to forward only explicit user-requested targets to CMake. - this avoids an accidental attempt by
maketo rebuildMakefileitself through the old catch-all%rule.
- the root
SHM_HYBRIDreusable-library support is required for both C and Rust.
- Source: user clarification "shm-hybrid needs to be implemented for C and rust, not just rust."
- Fact:
- C already has reusable
SHM_HYBRIDsupport insrc/libnetdata/netipc/. - Rust now has reusable
SHM_HYBRIDsupport insrc/crates/netipc/.
- C already has reusable
- Implication:
- the reusable-library gap for direct
SHM_HYBRIDis closed for C and Rust. - the remaining Rust gap is negotiated UDS/bench helper deduplication, not direct SHM library support.
- Go remains
UDS_SEQPACKET-only in this phase.
- the reusable-library gap for direct
- Windows implementation should be developed against a real Windows machine, and the repo should be pushed to
netdata/plugin-ipcbefore that work starts.
- Source: user proposal to push the repo and provide Windows access over SSH.
- User decision:
- approved push to
netdata/plugin-ipc - approved use of
ssh win11, starting frommsys2, for Windows development and validation
- approved push to
- Fact:
- this repo currently has no configured
git remote. - Rust Windows transport is still a placeholder in
src/crates/netipc/src/transport/windows.rs. - Go Windows transport is still a placeholder package in
src/go/pkg/netipc/transport/windows/. - the C library now has an initial Windows Named Pipe transport under
src/libnetdata/netipc/src/transport/windows/.
- this repo currently has no configured
- Implication:
- cross-compilation alone is not enough for confidence here; Named Pipe behavior, timeouts, permissions, and benchmark results need real runtime validation on Windows.
- pushing the repo first will make the Windows work easier to sync, review, and validate across machines.
- Windows MSYS2 environment should use
MINGW64, with build packages installed viapacman, while Rust should be installed separately viarustup.
- Source: user asked which MSYS2 packages to install and then instructed to ssh and run the install.
- User decision:
- approved installation of the recommended MSYS2 package set on
win11 - clarified that
ssh win11lands in/home/costaunder MSYS2, with repositories available under/home/costa/src/ - clarified that the remaining Windows prerequisites are now installed
- approved installation of the recommended MSYS2 package set on
- Package set:
mingw-w64-x86_64-toolchainmingw-w64-x86_64-cmakemingw-w64-x86_64-ninjamingw-w64-x86_64-pkgconf
- Implication:
- Windows builds should run under
MINGW64paths/toolchain, not the broken plain-MSYS/bin/cmake.
- Windows builds should run under
- Windows support must cover two build/runtime paths during the Netdata transition:
- native Windows builds under
MINGW64 - MSYS2 POSIX-emulation builds on Windows, because current Netdata still runs there
- Source: user clarification "the windows library needs to also compile under msys2 (posix emulation), because currently netdata runs with posix emulation and we port it to mingw64 (not done yet)"
- Implication:
- the repository cannot treat "Windows" as one build target only
- native Windows transport work still targets Named Pipes
- the POSIX transport/library paths also need to compile on Windows under MSYS2 while Netdata remains on the emulation runtime
- Risk:
- some transport assumptions are no longer simply "POSIX vs Windows"; build-system and source guards need to distinguish Linux/macOS/FreeBSD, MSYS2-on-Windows, and native Windows carefully
- Before continuing Windows implementation work, keep the GitHub repo in sync with the latest recorded Windows findings so Linux and Windows work starts from the same visible baseline.
- Source: user proposal "make sure the repo is synced to github and then I will start you on windows."
- User decision: approved commit/push of the updated TODO state before Windows work starts (
1a). - Fact:
- local
HEADandorigin/maincurrently point to the same commit1917f75 - the only current local modification is
TODO-plugin-ipc.md
- local
- Implication:
- to make the repo fully synced, either the updated TODO must be committed and pushed, or the Windows analysis notes remain local only
- Pause local Windows transport optimization work while another assistant provides a fast-path attempt.
- Source: user direction "Wait. Don't do that. the other assistant is working to provide a fast path."
- Fact:
- current pushed native Windows C named-pipe transport was built and smoke-tested successfully on
win11 - reproduced max-throughput benchmark on
win11with the current fixture:mode=c-npipeduration_sec=5responses=79024throughput_rps=15804.65p50_us=43.30p95_us=105.60p99_us=180.30client_cpu_cores=0.428
- current pushed native Windows C named-pipe transport was built and smoke-tested successfully on
- Implication:
- do not continue optimizing or redesigning the Windows path in this session until the alternate fast-path proposal is available
- Benchmarking must use only the shipped library implementations; the deleted prototype benchmark binary should be replaced and then deleted.
- Source: user decision "proceed. All A" for Decision 1.
- Implication:
- private benchmark transport implementations are no longer authoritative
- benchmark orchestration code may remain, but transport behavior must live only in the library
- The authoritative benchmark transport set must include only real library transports:
- POSIX:
uds-seqpacket,shm-hybrid - Windows:
named-pipe - Source: user decision "proceed. All A" for Decision 2.
- Implication:
stream,dgram,shm-spin, andshm-semmust be removed from benchmark acceptance/reporting paths
- There must be no separate spin benchmark/product variant; keep only one SHM product path,
shm-hybrid.
- Source: user decision "proceed. All A" for Decision 3.
- Implication:
- any spin behavior that remains is only an internal implementation detail of
shm-hybrid shm-spinmust be removed as a named benchmark transport
- any spin behavior that remains is only an internal implementation detail of
- CPU accounting must stay out of the library itself; benchmark/helper executables may collect their own CPU usage internally and report it on exit, instead of depending on external scripts.
- Source: user clarification "The library should not measure its cpu. But the implementations using the library could by theselves open proc and do the work, on exit, without relying on external scripts."
- Implication:
- library APIs remain transport-only
- benchmark/live helper binaries may own Linux
/procsampling and process-lifecycle accounting
- Constraint discovered during refactor:
- script-side
/proc/<pid>sampling is unreliable for UDS server benchmarks because the server may exit on client disconnect before the script reads/proc - helper-owned final CPU reporting requires a clean benchmark shutdown/reporting path in the helper executable
- script-side
- Benchmark result ownership after removing the deleted prototype benchmark binary: Option A.
- Source: user decision "i agree" on the proposed option set.
- Decision:
- move benchmark orchestration into the helper executables themselves
- each helper gets a benchmark command that starts its own server child, runs the client loop, measures child/self CPU, and prints one final result row
- Implication:
- shell scripts become thin matrix orchestrators only
- benchmark correctness no longer depends on shell timing around server exit
- Scope of the first benchmark-orchestrator conversion: Option A.
- Source: user decision "i agree" on the proposed option set.
- Decision:
- convert all current authoritative POSIX benchmark paths together:
uds-seqpacketfor C, Rust, Goshm-hybridfor C
- keep the same model reserved for later Windows
named-pipeconversion
- convert all current authoritative POSIX benchmark paths together:
- Implication:
- Linux benchmark reporting becomes consistent across the active library-backed transports
- Linux UDS benchmark coverage must include all C/Rust/Go client-server combinations in both directions, at all three benchmark rates:
max100k/s10k/s- Source: user requirement "For each of the implementations we need: max, 100k/s, 10ks. All combinations tested/benchmarked: c, rust, go interoperability in both directions".
- Implication:
- same-language rows alone are not enough
- benchmark scripts must expand into a cross-language client/server matrix
- Ping-pong benchmark correctness is mandatory: every helper benchmark must fail on any counter mismatch and verify that the increment chain remains correct through the whole run.
- Source: user requirement "the ping-pong should be testing incrementing a counter and ensuring that the counter has been incremented properly at the end, or the test should fail."
- Fact from current code:
- helper clients already check
response == counter + 1on every request in C, Go, and Rust
- helper clients already check
- Required follow-up:
- the benchmark scripts must treat any helper-reported mismatch as a hard failure
- the final row must remain non-authoritative if the helper reports mismatches or inconsistent request/response counts
- Cross-language benchmark ownership model for the full C/Rust/Go matrix: Option A.
- Source: user decision "as you recommend".
- Decision:
- add helper commands for benchmark client/server roles separately
- let a thin matrix script compose cross-language client/server pairs
- each helper remains responsible for its own CPU measurement and correctness checks
- Implication:
- benchmark scripts merge helper-produced client/server rows instead of sampling processes externally
-
Remove the remaining legacy C UDS demo dependency from negotiated profile-
2testing by teachingnetipc-live-cthe same UDS profile override knobs already exposed by the old C demo binaries. -
Repository cleanup rule (clarified by Costa)
- This repo is library-only in purpose, but it keeps everything related to the library: source, documentation, unit/integration/stress tests, build systems, benchmarks, and scripts that exercise/validate the library.
- File retention criterion:
- Keep any file that is part of library source code.
- Keep any file that is part of library documentation.
- Keep any file that is part of unit / stress / integration tests for the library.
- Keep any file that is part of building / compiling / packaging the library.
- Keep any file that is part of benchmarking the library.
- Delete anything else.
- Mixed files should be cleaned so that only library-related content remains.
- Implication: validation and benchmark assets stay if they exercise the real library; obsolete prototype-only binaries, demo paths, and unrelated helper code should be removed.
-
Cleanup execution scope for the current pass
-
Cleanup results from the current pass
-
Deleted obsolete private benchmark source the deleted prototype benchmark source.
-
Deleted obsolete C demo sources:
- the deleted SHM server demo source
- the deleted SHM client demo source
- the deleted UDS server demo source
- the deleted UDS client demo source
-
Switched remaining SHM interop coverage to
netipc-live-c, so the deleted demo files are no longer needed by tests or build targets. -
Removed demo targets from
src/libnetdata/netipc/CMakeLists.txt. -
Cleaned
README.mdand.gitignoreso they no longer advertise obsolete demo/prototype artifacts. -
Full Linux validation after cleanup passed:
./tests/run-interop.sh./tests/run-live-interop.sh./tests/run-live-uds-interop.sh./tests/run-uds-seqpacket.sh./tests/run-uds-negotiation-negative.sh./tests/run-live-shm-bench.sh./tests/run-live-uds-bench.sh./tests/run-negotiated-profile-bench.sh
-
Delete obsolete private benchmark implementation the deleted prototype benchmark source and stop documenting it.
-
Delete obsolete C demo binaries/sources once their remaining SHM interop use is switched to
netipc-live-c. -
Keep fixture crates/modules and helper binaries that exercise the real library (
tests/fixtures/*,bench/drivers/go,bench/drivers/rust). -
Keep benchmark and validation scripts in
tests/because they benchmark/test the library. -
Keep
Makefilebecause it is a build convenience wrapper around CMake and therefore library-build related. -
Remove generated artifacts and stale local binaries from the working tree (
build/, Rusttarget/dirs, generated Go helper binary).
-
-
Repository cleanup request: remove everything not related to the library we are building.
- Source: user request "cleanup everything that is not related to the library we are building."
- User clarification:
- delete anything not related to the library: sources, binaries, scripts, documentation, anything
- validation harnesses, helper binaries, benchmark drivers, and unrelated docs should not stay just because they were convenient during prototyping
- Current status:
- taking this literally means converting the repo from
library + validationintolibrary-only - active validation currently still depends on helper binaries, fixtures, and benchmark/test scripts
- taking this literally means converting the repo from
- Pending decision:
- whether any repo-local validation/documentation is still considered part of the deliverable, or whether the repo should contain only the publishable library packages and their native package/build metadata
- Source: user approval "yes, please do this right" after the remaining-cleanup note.
- Decision:
- extend
netipc-live-cUDS commands to accept optionalsupported_profiles,preferred_profiles, andauth_tokenoverrides - preserve the old positional convention for override values, including the legacy one-shot
iterations=1placeholder used by the old demo binaries
- extend
- Implication:
- negotiated C<->Rust SHM tests and UDS negative tests can move fully onto the live helper path
- the old C UDS demo binaries stop being required for active validation
- Completed:
- removed the private C benchmark transport source the deleted prototype benchmark source
- removed the the deleted prototype benchmark binary build target from the CMake build graph
- added a thin POSIX C live fixture at
tests/fixtures/c/netipc_live_posix_c.cuds-server-onceuds-client-onceuds-server-loopuds-client-benchshm-server-onceshm-client-onceshm-server-loopshm-client-bench
- switched
tests/run-live-uds-bench.shto benchmark C UDS throughbuild/bin/netipc-live-cinstead of the deleted prototype benchmark binary - updated
README.mdso the repository no longer advertises the deleted prototype benchmark binary as a primary artifact or benchmark path - moved authoritative benchmark ownership into the helper executables:
- C helper now exposes
uds-benchandshm-bench - Go helper now exposes
uds-bench - Rust helper now exposes
uds-bench
- C helper now exposes
- benchmark helper executables now own server lifecycle and CPU reporting
- benchmark shell scripts no longer sample
/procor manage benchmark server PIDs directly
- Linux validation run after the refactor:
cmake --build build./tests/run-live-uds-bench.sh- manual smoke:
build/bin/netipc-live-c shm-server-once+build/bin/netipc-live-c shm-client-once ./tests/run-live-interop.sh./tests/run-live-shm-bench.sh
- Current outcome:
- authoritative C Linux UDS benchmark data now comes from the public library API path
- the fake standalone transport benchmark path is gone
shm-spinno longer exists as a benchmark transport path in the repository- authoritative Linux benchmark CPU reporting now comes from the helper executables, not the library and not the shell harness
tests/run-live-uds-bench.shnow runs the full Linux UDS directedC/Rust/Gomatrix (9client/server pairs) atmax,100k/s, and10k/stests/run-live-uds-interop.shnow runs the full directed baseline UDS profile-1matrix and keeps the negotiated C<->Rust profile-2SHM cases- helper benchmarks now fail hard on:
- any non-OK response status
- any
response != request + 1mismatch - any
requests != responsesmismatch - any final counter-chain mismatch
- any server handled-count mismatch versus client responses
netipc-live-cnow exposes optional UDSsupported_profiles,preferred_profiles, andauth_tokenoverrides for:- one-shot commands
- loop commands
- bench commands
netipc-live-cnow exposes a repeated-call UDS client mode so the basic seqpacket smoke test can also stay on the live helper path- no UDS test script under
tests/*.shdepends on the old C UDS demo binaries anymore
- Historical the deleted prototype benchmark binary notes elsewhere in this TODO are prototype history only.
- They are no longer authoritative for acceptance after Decisions
65through73. - Current authoritative Linux acceptance data must come from:
./tests/run-live-uds-bench.sh./tests/run-live-shm-bench.sh- helper binaries that call only the shipped library code paths
- Passed:
./tests/run-uds-seqpacket.sh./tests/run-live-uds-bench.sh./tests/run-live-uds-interop.sh./tests/run-live-shm-bench.sh./tests/run-live-interop.sh./tests/run-uds-negotiation-negative.sh
- Linux UDS benchmark matrix (
27rows, all helper/library-backed, all strict-correctness checked):maxc -> c: ~206.8k req/s, p50 ~4.34us, total CPU ~1.016c -> rust: ~220.7k req/s, p50 ~4.02us, total CPU ~0.986c -> go: ~198.6k req/s, p50 ~4.54us, total CPU ~1.410rust -> c: ~217.1k req/s, p50 ~4.14us, total CPU ~0.996rust -> rust: ~236.6k req/s, p50 ~3.81us, total CPU ~0.975rust -> go: ~232.0k req/s, p50 ~3.63us, total CPU ~1.450go -> c: ~199.9k req/s, p50 ~4.46us, total CPU ~1.392go -> rust: ~222.0k req/s, p50 ~4.18us, total CPU ~1.433go -> go: ~200.4k req/s, p50 ~4.58us, total CPU ~1.907
100k/s- all
9directed pairs held ~100k req/s with strict increment validation - lowest total CPU in this run:
rust -> rustat ~0.448 cores - highest total CPU in this run:
c -> goat ~1.271 cores
- all
10k/s- all
9directed pairs held ~10k req/s with strict increment validation - lowest total CPU in this run:
rust -> rustat ~0.066 cores - highest total CPU in this run:
c -> goat ~1.042 cores
- all
- Linux SHM benchmark (
C shm-hybrid, helper/library-backed):max: ~2.95M req/s, p50 ~0.31us, total CPU ~1.982100k/s: ~100.0k req/s, p50 ~3.33us, total CPU ~1.14510k/s: ~10.0k req/s, p50 ~3.98us, total CPU ~0.996
- Completed:
- C library sources moved under
src/libnetdata/netipc/. - Go helper and benchmark code moved under
src/go/,tests/fixtures/go/, andbench/drivers/go/. - Rust helper and benchmark code moved under
src/crates/,tests/fixtures/rust/, andbench/drivers/rust/. - Root build entry switched to
CMake, with helper targets for C, Rust, and Go validation binaries. - Test scripts updated to consume artifacts from
build/bin/. - README updated to describe the approved layout and build flow.
- Obsolete prototype leftovers removed from the repository root (
interop/,include/). - Rust crate now contains:
- reusable frame/protocol encode/decode API
- first reusable POSIX
UDS_SEQPACKETclient/server API - reusable POSIX
SHM_HYBRIDclient/server API aligned with the C library state machine - negotiated UDS profile-
2switch that reuses the crateSHM_HYBRIDAPI instead of helper-only code - crate-level unit tests for protocol and auth/roundtrip transport behavior
- Go package now contains:
- reusable frame/protocol encode/decode API
- first reusable POSIX
UDS_SEQPACKETclient/server API - package-level unit tests for protocol and auth/roundtrip transport behavior
- Schema codec fixtures now consume the reusable Rust crate and Go package instead of embedding private protocol copies.
- Rust
netipc_live_rsfixture now consumes the reusable crateSHM_HYBRIDAPI instead of embedding private mmap/semaphore code. - Rust
netipc_live_uds_rshelper now consumes the reusable crate UDS transport, including negotiatedSHM_HYBRID, instead of embedding a separate transport implementation. - Go
netipc-live-gohelper now consumes the reusable Go package for normal UDS server/client/bench flows instead of embedding a separate transport implementation. - Root
CMakeLists.txthelper targets now depend on library source trees so fixture/helper rebuilds track library changes. - C library now contains an initial Windows Named Pipe transport and Windows live fixture build path for MSYS2
mingw64/ucrt64. - C library now also contains a Windows negotiated
SHM_HYBRIDfast profile backed by shared memory plus named events with bounded spin. - C library now also contains a Windows negotiated
SHM_BUSYWAITfast profile backed by shared memory plus pure busy-spin for the single-client low-latency case. - Windows fast-path design is intentionally limited to Win32 primitives that can be ported to Rust and pure Go without
cgo.
- C library sources moved under
- Validated:
- schema interop:
C <-> Rust <-> Go - live SHM interop:
C <-> Rust - live UDS interop:
C <-> Rust <-> Go - Windows C Named Pipe smoke under MSYS2
mingw64:./tests/run-live-npipe-smoke.sh - Windows C profile comparison under MSYS2
mingw64:./tests/run-live-win-profile-bench.sh- latest local result on
win11, 5s, 1 client:c-npipe: ~15.2k req/s, p50 ~49.0usc-shm-hybrid(default spin1024): ~84.8k req/s, p50 ~3.7usc-shm-busywait: ~84.5k req/s, p50 ~3.7us, lower p99 tail thanSHM_HYBRID
- latest local result on
- UDS negative negotiation coverage
- UDS and negotiated-profile benchmark scripts
cargo test -p netipcgo test ./...undersrc/go
- schema interop:
- Still incomplete:
- Go package currently implements the reusable POSIX
UDS_SEQPACKETpath only. - Go negative-test helper logic for malformed/raw negotiation frames remains local to
bench/drivers/go; this is fixture-specific coverage, not reusable API. - Rust and Go Windows transports remain placeholders.
- Windows validation is still limited to the C Named Pipe/
SHM_HYBRID/SHM_BUSYWAITpath; cross-language Windows interop and benchmark coverage are still pending. - TODO/history text still contains historical references to the old prototype paths and should be cleaned once the structure is frozen.
- Go package currently implements the reusable POSIX
- Fact:
NETDATA_INVOCATION_IDis already read by the Rust plugin runtime in Netdata:~/src/netdata-ktsaou.git/src/crates/netdata-plugin/rt/src/netdata_env.rs
- Fact: Netdata logging initialization checks
NETDATA_INVOCATION_ID, falls back to systemdINVOCATION_ID, and then exportsNETDATA_INVOCATION_IDback into the environment:~/src/netdata-ktsaou.git/src/libnetdata/log/nd_log-init.c
- Fact: Netdata plugin documentation already describes
NETDATA_INVOCATION_IDas the unique invocation UUID for the current Netdata session:~/src/netdata-ktsaou.git/src/plugins.d/README.md
- Fact: Several existing shell/plugin entrypoints pass through
NETDATA_INVOCATION_ID, so using it for IPC auth aligns with existing Agent behavior. - Fact: User decision supersedes direct environment lookup inside the library itself; the library should remain integration-agnostic and accept auth as an API input.
- Implication: IPC auth contract should be modeled as caller-provided UUID/token input, with Netdata integration later responsible for supplying
NETDATA_INVOCATION_ID. - Risk: test/demo code may still use environment variables for convenience, but the reusable library surface must not depend on that mechanism.
- Create benchmark harness structure for C/Rust/Go with shared test scenarios.
- Implement primary benchmark scenario first: client-server counter loop.
- Client sends number N; server increments and returns N+1.
- Client verifies increment by exactly +1 and sends returned value back.
- Run duration: 5 seconds per test.
- Capture: requests sent, responses received, latest counter value, latency per request-response pair, CPU utilization (client and server).
- Benchmark all transport candidates selected in Decisions #1 and #7 using the same scenario and persistent-session model (Decision #8).
- For shared-memory transport, benchmark both synchronization approaches from Decision #10.
- Run the strict ping-pong baseline and the current advanced-throughput benchmark mode.
- Historical note: earlier plan text assumed general pipelined mode as the secondary path.
- Current direction after Decisions #41-#44 is ordered-batch benchmarking first, with general pipelining deferred unless measurements prove it is still required.
- Collect CPU metrics using both measurement approaches from Decision #12.
- Produce decision report with a primary score for single-thread ping-pong and a secondary score for multi-client scaling.
- Freeze transport + framing + custom binary schema for v1.
- Implement minimal typed RPC surface in all 3 language libraries.
- Implement auth handshake using shared SALT/session UUID.
- Add integration test: C client -> Rust server, Rust client -> Go server, Go client -> C server.
- Add stress tests to validate stability and tail latency.
- Define and freeze a minimal v1 typed schema for one RPC method (
increment) with fixed-size binary frame and explicit little-endian encode/decode. - Add a C API module for POSIX shared-memory hybrid transport with default spin window 20, targeting single-thread ping-pong first.
- Add simple C server/client examples using the new C API and schema.
- Add Rust and Go schema codecs and interoperability fixtures.
- Add an automated interop test runner to validate:
- C client framing <-> Rust server framing.
- Rust client framing <-> Go server framing.
- Go client framing <-> C server framing.
- Document baseline and current limitations explicitly.
- Add a POSIX C
UDS_SEQPACKETtransport module as the v1 baseline profile. - Add fixed-binary handshake frames for capability negotiation and server-selected profile selection.
- Keep handshake bitmask-compatible with optional future profiles.
- Add C server/client demos for persistent session request-response.
- Add an automated test for handshake correctness plus increment loop behavior.
- Keep SHM transport and live interop tests intact to avoid regressions in parallel work.
- Extend C UDS transport profile mask to implement
NETIPC_PROFILE_SHM_HYBRIDnegotiation candidate. - Keep handshake over
UDS_SEQPACKET, then switch request/response data-plane to shared-memory hybrid when profile2is selected. - Extend Rust native UDS live runner to support the same negotiated profile switch.
- Keep Go native UDS live runner baseline-only (
profile 1) so C/Rust<->Go automatically fall back to UDS. - Add/extend live interoperability tests to validate:
- C<->Rust prefers
SHM_HYBRIDwhen both advertise it. - Any path involving Go negotiates
UDS_SEQPACKET.
- C<->Rust prefers
- Add/extend benchmark scripts to compare negotiated
profile 1vsprofile 2in C/Rust paths.
- Replace busy/yield rate pacing loops with adaptive sleep pacing in:
- the deleted prototype benchmark source
interop/rust/src/bin/netipc_live_uds_rs.rsinterop/go-live/main.go
- Keep full-speed (
target_rps=0) behavior unchanged. - Rebuild and rerun benchmark comparisons:
tests/run-live-uds-bench.shtests/run-negotiated-profile-bench.sh
- Update TODO and results tables with post-fix metrics.
- Remove pure-Go SHM polling commands from
interop/go-live/main.gocommand surface. - Remove dedicated pure-go-poll benchmark script usage (
tests/run-poll-vs-sem-bench.sh). - Remove SHM live interop script dependency on Go polling commands (
tests/run-live-interop.sh) by narrowing scope to C<->Rust only. - Update README and TODO to reflect that Go interop path is UDS-only baseline in this phase.
- Rebuild and rerun the remaining active test matrix to ensure no regressions.
Status (2026-03-08): completed. Validation reruns passed:
make./tests/run-live-interop.sh./tests/run-live-uds-interop.sh./tests/run-live-uds-bench.sh./tests/run-negotiated-profile-bench.sh./tests/run-interop.sh./tests/run-uds-negotiation-negative.sh
- Initialize new public repository skeleton (C/Rust/Go library directories, shared protocol spec, CI layout).
- Port current proven protocol/interop baseline (
UDS_SEQPACKET+ negotiation) into reusable library APIs. - Implement Windows transport baseline and align handshake/auth semantics with POSIX.
- Build exhaustive test matrix per language (unit + integration + cross-language interop) with coverage gates.
- Build benchmark matrix automation for Linux and Windows role combinations on GitHub-hosted cloud VMs and publish machine-readable artifacts.
- Add benchmark noise controls for cloud VMs:
- run each benchmark scenario in multiple repetitions (for example 5)
- report median plus spread (p95 of run-to-run throughput/latency)
- capture runner metadata (OS image, vCPU model, memory, kernel/build info)
- compare combinations primarily by relative ranking, not single-run absolute numbers
- Freeze acceptance criteria (coverage gate + benchmark sanity gate) before any Netdata integration work.
- Verify the actual Netdata runtime auth environment contract (
NETDATA_INVOCATION_IDor replacement) in the Agent codebase and align the IPC handshake implementation with it. - Produce benchmark evidence for acceptance review and use that review to decide whether the standalone repo is ready to integrate into Netdata.
- Fact: Current benchmark shared-memory transport is embedded in a benchmark executable and not yet a standalone reusable library.
- Fact: A full production-ready cross-language shared-memory transport API (with lifecycle, reconnect, multi-client fairness, and auth handshake) is larger than this phase.
- Working constraint for this phase:
- Prioritize protocol/schema interoperability and a minimal C transport API suitable for iterative hardening.
- Fact: Top-level currently mixes source artifacts, generated binaries, library code, demos, benchmark executable, and language experiments:
- top-level binaries:
netipc-codec-c, demo binaries,libnetipc.a, and a prototype benchmark executable - source directories:
src/,include/,interop/,tests/
- top-level binaries:
- Fact:
interop/is overloaded:interop/gois a schema codec toolinterop/go-liveis a transport/live benchmark runnerinterop/rustcontains both codec tool and live runners
- Fact: Current layout does not map cleanly to the intended product boundary of:
- one C library
- one Rust library
- one Go library
- each with POSIX and Windows implementations
- Risk: if we keep the current layout, adding Windows, CI, coverage, public API docs, examples, and protocol evolution will create repeated ad-hoc structure and unclear ownership of code paths.
- Requirement for the redesign:
- separate reusable library code from experiments, examples, benchmarks, interop fixtures, and generated build output
- make transport implementations discoverable by language and platform
- keep one shared protocol/spec area so wire semantics are not duplicated informally
- Fact: Netdata destination layout already has stable language-specific homes:
- C core/common code under
src/libnetdata/... - Go packages under
src/go/pkg/... - Rust workspace/crates under
src/crates/...
- C core/common code under
- Fact: Netdata top-level
CMakeLists.txtalready:- builds
libnetdata - imports Rust crates from
src/crates/... - drives Go build targets rooted at
src/go/...
- builds
- Implication: the standalone
plugin-ipcrepository should mirror these boundaries as closely as possible, otherwise future integration will require a second structural rewrite.
- Done: Phase 1 POSIX C benchmark harness implemented (the deleted prototype benchmark binary).
- Done: Transport candidates implemented for benchmarking:
AF_UNIX/SOCK_STREAMAF_UNIX/SOCK_SEQPACKETAF_UNIX/SOCK_DGRAM- Shared memory ring with spin synchronization (
shm-spin) - Shared memory ring with blocking semaphore synchronization (
shm-sem) - Shared memory ring with hybrid synchronization (
shm-hybrid). - Done: fixed-rate client pacing support (
--target-rps, per-client) for hypothesis testing. - Done: Both benchmark modes implemented:
- strict ping-pong mode
- pipelined mode with configurable depth
- Done: Metrics implemented:
- request/response counters
- mismatch detection
- last counter tracking
- latency histogram with p50/p95/p99
- in-process CPU accounting
- external CPU sampling from
/proc - Done: Build and usage docs added in
README.md. - Done (historical): POSIX baseline was initially frozen to
shm-hybridwith default spin window20(NETIPC_SHM_DEFAULT_SPIN_TRIES), later superseded by Decision #26 (UDS_SEQPACKETbaseline profile). - Done (historical benchmark scaffold): first typed v1 schema implemented (
increment) with fixed 64-byte frame and explicit little-endian encode/decode.- Current architecture direction after Decisions #40-#44 is to replace this with a fixed-header plus variable-payload protocol for the real library API.
include/netipc_schema.hsrc/netipc_schema.c
- Done: First C transport API implemented for shared-memory hybrid path:
include/netipc_shm_hybrid.hsrc/netipc_shm_hybrid.c
- Done: C tools/examples added:
netipc-codec-c(src/netipc_codec_tool.c)- the deleted SHM server demo binary (the deleted SHM server demo source)
- the deleted SHM client demo binary (the deleted SHM client demo source)
- Done: Cross-language schema interop tools added:
- Rust codec tool (
interop/rust) - Go codec tool (
interop/go)
- Rust codec tool (
- Done: Automated interop validation script added:
tests/run-interop.sh- Validates C->Rust->C, Rust->Go->Rust, Go->C->Go for typed
incrementschema frames.
- Done: Build system updated (
Makefile) to produce:libnetipc.anetipc-codec-c- the deleted SHM server demo binary
- the deleted SHM client demo binary
- Done: Stale endpoint recovery added to C shared-memory transport:
- Region ownership PID tracking.
- Safe takeover rules:
- owner alive ->
EADDRINUSE - owner dead/invalid region -> unlink stale endpoint and recreate.
- owner alive ->
- Done: Live transport interop runners added for this phase:
- Rust live runner:
interop/rust/src/bin/netipc_live_rs.rs - Go live runner:
interop/go-live/main.go - Live interop script:
tests/run-live-interop.sh
- Rust live runner:
- Done: Rust/Go live runners moved to independent transport implementations (no link dependency on
libnetipc.a):- Rust: native mapping + sequencing + schema encode/decode + semaphore waits/posts (
libcbindings). - Go: native pure-Go UDS seqpacket + schema encode/decode (no cgo).
- C/Rust waits are semaphore-optional for profile-2 interop when needed.
- Safety hardening:
- Rust stale-takeover mapping uses duplicated FDs (no double-close risk).
- Rust only destroys semaphores after successful semaphore initialization.
- Rust: native mapping + sequencing + schema encode/decode + semaphore waits/posts (
- Done: Removed pure-Go SHM polling path after profiling and decision #35:
- Removed Go SHM polling command handlers from
interop/go-live/main.go. - Removed
tests/run-poll-vs-sem-bench.sh. - Narrowed
tests/run-live-interop.shto C<->Rust SHM coverage only.
- Removed Go SHM polling command handlers from
- Done: POSIX C
UDS_SEQPACKETtransport module added:include/netipc_uds_seqpacket.hsrc/netipc_uds_seqpacket.c
- Done: Fixed-binary profile negotiation added to UDS transport:
- client hello with supported/preferred bitmasks
- server ack with intersection + selected profile + status
- deterministic server-selected profile flow
- Done: UDS stale endpoint recovery logic added:
- active listener -> fail with
EADDRINUSE - stale socket path -> unlink and recreate
- active listener -> fail with
- Done: C UDS demos added:
- the deleted UDS server demo binary (the deleted UDS server demo source)
- the deleted UDS client demo binary (the deleted UDS client demo source)
- Done: Build updated for UDS artifacts (
Makefile). - Done: Automated UDS negotiation test added:
tests/run-uds-seqpacket.sh
- Done: Native Rust live UDS runner added (no
libnetipc.alink dependency):interop/rust/src/bin/netipc_live_uds_rs.rs- Supports:
server-once <run_dir> <service>client-once <run_dir> <service> <value>server-loop <run_dir> <service> <max_requests|0>client-bench <run_dir> <service> <duration_sec> <target_rps>
- Implements seqpacket + fixed-binary negotiation profile.
- Done: Native pure-Go live UDS commands added to Go runner (
CGO_ENABLED=0):interop/go-live/main.go- Supports:
uds-server-once <run_dir> <service>uds-client-once <run_dir> <service> <value>uds-server-loop <run_dir> <service> <max_requests|0>uds-client-bench <run_dir> <service> <duration_sec> <target_rps>uds-client-badhello <run_dir> <service>uds-client-rawhello <run_dir> <service> <supported_mask> <preferred_mask> <auth_token>
- Implements seqpacket + fixed-binary negotiation profile.
- Done: Live cross-language UDS interop test added:
tests/run-live-uds-interop.sh- Validates:
- C client -> Rust server
- Rust client -> Go server
- Go client -> C server
- Done: Live UDS benchmark matrix script added:
tests/run-live-uds-bench.sh- Measures C/Rust/Go UDS paths at max, 100k/s, 10k/s.
- Reports throughput, p50 latency, client CPU, server CPU, total CPU.
- Done: UDS negotiation negative test script added:
tests/run-uds-negotiation-negative.sh- Validates expected failures for:
- profile mismatch (
ENOTSUP) - auth mismatch (
EACCES) - malformed hello (
EPROTO)
- profile mismatch (
- Done: C UDS demos extended with optional negotiation inputs:
- the deleted UDS server demo source: optional
supported_profiles,preferred_profiles,auth_token. - the deleted UDS client demo source: optional
supported_profiles,preferred_profiles,auth_token.
- the deleted UDS server demo source: optional
- Done: C UDS transport now implements negotiated
SHM_HYBRIDdata-plane switch (profile2) in addition to baselineUDS_SEQPACKET:src/netipc_uds_seqpacket.c- Behavior:
- handshake remains over UDS seqpacket
- if negotiated profile is
2, request/response path switches to shared-memory hybrid API - client-side attach includes short retry loop to avoid race with server shared-memory region creation
- Done: Rust native UDS live runner now supports negotiated
SHM_HYBRIDdata-plane switch (profile2) with env-driven handshake profile masks:interop/rust/src/bin/netipc_live_uds_rs.rs- Added env controls:
NETIPC_SUPPORTED_PROFILESNETIPC_PREFERRED_PROFILESNETIPC_AUTH_TOKEN
- C/Rust
server-once,client-once,server-loop,client-benchswitch data-plane based on negotiated profile.
- Done: Live UDS interop script extended with negotiated profile assertions:
tests/run-live-uds-interop.sh- Added coverage for:
- C client -> Rust server with profile preference
2(expects profile2) - Rust client -> C server with profile preference
2(expects profile2) - baseline/fallback profile checks stay enforced for paths involving Go.
- C client -> Rust server with profile preference
- Done: Added negotiated profile benchmark script:
tests/run-negotiated-profile-bench.sh- Compares profile
1(UDS_SEQPACKET) vs profile2(SHM_HYBRID) under identical 5s client-bench scenarios.
- Done: Rate-limited benchmark client pacing switched from busy/yield loops to adaptive sleep pacing:
- the deleted prototype benchmark source (
sleep_until_ns, fixed-rate schedule progression) interop/rust/src/bin/netipc_live_uds_rs.rs(sleep_until)interop/go-live/main.go(sleepUntil)- Intent: remove pacing busy-loop CPU pollution from fixed-rate benchmark metrics.
- the deleted prototype benchmark source (
make clean && make: pass../tests/run-interop.sh: pass.- Schema interop validated for:
- C -> Rust -> C
- Rust -> Go -> Rust
- Go -> C -> Go
- Schema interop validated for:
- C shared-memory API smoke test:
- Server:
./the deleted SHM server demo binary /tmp netipc-demo 2 - Client:
./the deleted SHM client demo binary /tmp netipc-demo 50 2 - Result: pass (
50->51,51->52).
- Server:
- Live transport interop test:
./tests/run-live-interop.sh: pass.- Validated real
shm-hybridsessions for:- C client -> Rust server
- Rust client -> C server
- Native runner validation:
- Rust live binary (
netipc_live_rs) builds without linking tolibnetipc.a. - Go live binary (
netipc-live-go) builds without linking tolibnetipc.a. lddcheck confirms nolibnetipcdependency in either live binary.- Go live binary builds and runs with
CGO_ENABLED=0.
- Rust live binary (
- Regression validation after safety fixes (
cargo fmt,gofmt, full test rerun): pass. - Stale endpoint recovery test:
- Forced stale endpoint by terminating a server process before cleanup.
- New server startup recovered stale endpoint and served request successfully.
- Historical (pre-removal) focused polling-vs-semaphore benchmark (
./tests/run-poll-vs-sem-bench.sh): pass.- latest rerun after pacing fix:
max:- semaphore: throughput ~1,753,540 req/s, p50 ~0.54us, total CPU ~1.966
- pure-go-poll: throughput ~1,535 req/s, p50 ~1054.35us, total CPU ~0.024
100k/starget:- semaphore: throughput ~99,999 req/s, p50 ~3.20us, total CPU ~0.418
- pure-go-poll: throughput ~1,049 req/s, p50 ~1054.89us, total CPU ~0.015
10k/starget:- semaphore: throughput ~9,999 req/s, p50 ~3.20us, total CPU ~0.054
- pure-go-poll: throughput ~1,076 req/s, p50 ~1055.22us, total CPU ~0.018
- Fact: current pure-Go polling configuration does not reach 10k/s.
- latest rerun after pacing fix:
- UDS negotiation + increment test (
./tests/run-uds-seqpacket.sh): pass.- Negotiated profile is
1(NETIPC_PROFILE_UDS_SEQPACKET) on both client/server. - Verified increment loop for two iterations (
41->42,42->43).
- Negotiated profile is
- Live UDS cross-language interop test (
./tests/run-live-uds-interop.sh): pass.- C client -> Rust server: pass, profile
1. - Rust client -> Go server: pass, profile
1. - Go client -> C server: pass, profile
1.
- C client -> Rust server: pass, profile
- UDS negotiation negative tests (
./tests/run-uds-negotiation-negative.sh): pass.- Profile mismatch with raw hello mask
2-> status95(ENOTSUP). - Auth mismatch (
111vs222) -> status13(EACCES). - Malformed hello frame -> server rejects with
EPROTO.
- Profile mismatch with raw hello mask
- Live UDS benchmark matrix (
./tests/run-live-uds-bench.sh): pass.- latest rerun:
max:- c-uds: throughput ~254,999 req/s, p50 ~3.71us, total CPU ~1.010
- rust-uds: throughput ~268,966 req/s, p50 ~3.59us, total CPU ~1.001
- go-uds: throughput ~243,775 req/s, p50 ~3.50us, total CPU ~1.950
100k/starget:- c-uds: throughput ~99,999 req/s, p50 ~4.35us, total CPU ~0.424
- rust-uds: throughput ~100,000 req/s, p50 ~3.99us, total CPU ~0.397
- go-uds: throughput ~99,985 req/s, p50 ~3.51us, total CPU ~0.841
10k/starget:- c-uds: throughput ~10,000 req/s, p50 ~4.35us, total CPU ~0.048
- rust-uds: throughput ~10,000 req/s, p50 ~4.11us, total CPU ~0.048
- go-uds: throughput ~9,998 req/s, p50 ~4.30us, total CPU ~0.124
- latest rerun:
- Negotiated profile interop expansion (
./tests/run-live-uds-interop.sh): pass.- Added assertions verify:
- C<->Rust can negotiate
profile 2(SHM_HYBRID) when both peers advertise/prefer it. - Paths involving Go continue to negotiate
profile 1(UDS_SEQPACKET) fallback.
- C<->Rust can negotiate
- Added assertions verify:
- Negotiated profile benchmark (
./tests/run-negotiated-profile-bench.sh): pass.- Rust server <-> Rust client:
max:- profile1-uds: throughput ~267,448 req/s, p50 ~3.56us, total CPU ~1.000
- profile2-shm: throughput ~4,353,061 req/s, p50 ~0.20us, total CPU ~2.323
100k/starget:- profile1-uds: throughput ~99,999 req/s, p50 ~4.00us, total CPU ~0.403
- profile2-shm: throughput ~99,999 req/s, p50 ~0.21us, total CPU ~0.180
10k/starget:- profile1-uds: throughput ~10,000 req/s, p50 ~4.20us, total CPU ~0.048
- profile2-shm: throughput ~10,000 req/s, p50 ~3.40us, total CPU ~0.056
- Rust server <-> Rust client:
- Regression checks after UDS implementation:
./tests/run-interop.sh: pass../tests/run-live-interop.sh: pass.
- Regression checks after extracting reusable Rust
SHM_HYBRIDcrate API:cargo test -p netipc: pass../tests/run-live-interop.sh: pass, withnetipc_live_rsnow backed by the crate API instead of private SHM code.
- Regression checks after extracting negotiated Rust UDS profile switching into the crate:
cargo test -p netipc: pass, including negotiatedSHM_HYBRIDroundtrip coverage../tests/run-live-uds-interop.sh: pass, includingprofile 2(SHM_HYBRID) negotiation forC <-> Rust../tests/run-uds-negotiation-negative.sh: pass../tests/run-negotiated-profile-bench.sh: pass../tests/run-live-uds-bench.sh: pass.
- Regression checks after refactoring Go UDS helper onto the reusable package:
go test ./...undersrc/go: pass../tests/run-live-uds-interop.sh: pass../tests/run-uds-negotiation-negative.sh: pass../tests/run-live-uds-bench.sh: pass.
- Seqpacket baseline benchmark spot check:
./the deleted prototype benchmark binary --transport seqpacket --mode pingpong --clients 1 --payloads 32 --duration 5- Result: throughput ~265,550 req/s, p50 ~3.46us.
uds-streampingpong: ~297k req/s, p50 ~2.94usuds-seqpacketpingpong: ~260k req/s, p50 ~3.46usuds-dgrampingpong: ~242k req/s, p50 ~3.97usshm-spinpingpong: ~2.61M req/s, p50 ~0.34usshm-sempingpong: ~330k req/s, p50 ~2.94usuds-streampipeline depth=16: ~1.01M req/suds-seqpacketpipeline depth=16: ~1.05M req/suds-dgrampipeline depth=16: ~857k req/sshm-spinpipeline depth=16: ~6.64M req/sshm-sempipeline depth=16: ~2.08M req/s
shm-spin, 1 client: ~2.77M req/s, p50 ~0.30us, total CPU cores ~1.99shm-hybrid, 1 client: ~1.96M req/s, p50 ~0.46us, total CPU cores ~1.99shm-sem, 1 client: ~312k req/s, p50 ~2.94us, total CPU cores ~1.02shm-spin, 8 clients: ~15.78M req/s, p50 ~0.43usshm-hybrid, 8 clients: ~4.84M req/s, p50 ~1.60usshm-sem, 8 clients: ~1.78M req/s, p50 ~4.35us- Single-thread pingpong winner:
shm-spin.
shm-spin: achieved ~99.0k req/s, p50 ~0.37us, client CPU ~0.995 cores, server CPU ~0.998 cores.shm-hybrid: achieved ~98.9k req/s, p50 ~0.54us, client CPU ~0.994 cores, server CPU ~0.992 cores.shm-sem: achieved ~98.3k req/s, p50 ~2.94us, client CPU ~0.833 cores, server CPU ~0.151 cores.- At 100k/s,
shm-hybridCPU is not significantly lower thanshm-spinin this implementation.
shm-spin: achieved ~9.99k req/s, p50 ~0.34us, client CPU ~0.995 cores, server CPU ~0.998 cores.shm-hybrid: achieved ~9.99k req/s, p50 ~1.86us, client CPU ~0.994 cores, server CPU ~0.111 cores.shm-sem: achieved ~9.98k req/s, p50 ~3.20us, client CPU ~0.975 cores, server CPU ~0.017 cores.- At 10k/s,
shm-hybriddoes block/wait more on server side (large server CPU drop vsshm-spin), while client CPU remains high due rate pacing strategy.
- Before (
SHM_HYBRID_SPIN_TRIES=256): ~98.9k req/s, p50 ~0.54us, client CPU ~0.994, server CPU ~0.992 (total ~1.986). - After (
SHM_HYBRID_SPIN_TRIES=64): ~98.5k req/s, p50 ~1.86us, client CPU ~0.987, server CPU ~0.364 (total ~1.351). - Effect: significant CPU drop (especially server) with moderate latency increase.
- Sweep values: 0, 1, 4, 16, 64, 256, 1024.
- Metrics tracked:
- Max throughput (unlimited mode): req/s, p50, total CPU cores.
- Fixed-rate 100k target: achieved req/s, p50, total CPU cores.
- Averaged findings:
- spin=0: max
336k req/s (p502.69us, cpu1.023), fixed-100k0.982)98.15k req/s (p502.94us, cpu - spin=1: max
338k req/s (p502.69us, cpu1.048), fixed-100k0.989)98.27k req/s (p502.94us, cpu - spin=4: max
372k req/s (p502.81us, cpu1.142), fixed-100k1.011)98.09k req/s (p502.94us, cpu - spin=16: max
1.17M req/s (p500.50us, cpu1.676), fixed-100k1.101)98.16k req/s (p502.94us, cpu - spin=64: max
1.88M req/s (p500.48us, cpu1.988), fixed-100k1.349)98.35k req/s (p501.86us, cpu - spin=256: max
1.95M req/s (p500.46us, cpu1.989), fixed-100k1.989)99.00k req/s (p500.54us, cpu - spin=1024: max
1.82M req/s (p500.52us, cpu1.989), fixed-100k1.990)99.06k req/s (p500.52us, cpu
- spin=0: max
- Key pattern:
- More spins increase max throughput (up to ~256), but also increase CPU at fixed 100k/s.
- Low spins reduce fixed-rate CPU but hurt latency and cap max throughput.
- Max-throughput means (unlimited mode) and marginal gain per spin:
- spin 8: ~526.7k req/s
- spin 12: ~768.8k req/s, +60.5k req/s per added spin
- spin 16: ~1.123M req/s, +88.6k req/s per added spin
- spin 20: ~1.840M req/s, +179.3k req/s per added spin
- spin 24: ~1.815M req/s, -6.4k req/s per added spin
- spin 28: ~1.829M req/s, +3.6k req/s per added spin
- spin 32: ~1.891M req/s, +15.4k req/s per added spin
- Fixed-100k CPU for same spins:
- spin 8: total CPU ~1.048 cores
- spin 12: total CPU ~1.075 cores
- spin 16: total CPU ~1.107 cores
- spin 20: total CPU ~1.126 cores
- spin 24: total CPU ~1.146 cores
- spin 28: total CPU ~1.230 cores
- spin 32: total CPU ~1.230 cores
- Interpretation:
- Strong gains continue through spin 20.
- After ~20, throughput gains become small/noisy while fixed-rate CPU keeps climbing.
- Validation:
- No failed clients.
- No mismatches/collisions.
- Max-throughput mode (
target_rps=0):shm-spin: ~2.61M req/s, p50 ~0.34us, total CPU ~1.985 cores.shm-hybrid: ~1.79M req/s, p50 ~0.53us, total CPU ~1.962 cores.shm-sem: ~335k req/s, p50 ~2.77us, total CPU ~1.021 cores.uds-stream: ~305k req/s, p50 ~2.94us, total CPU ~1.239 cores.uds-seqpacket: ~265k req/s, p50 ~3.46us, total CPU ~1.000 cores.uds-dgram: ~244k req/s, p50 ~3.88us, total CPU ~0.996 cores.
- Fixed-rate mode (
target_rps=100k):shm-spin: ~99.0k req/s, p50 ~0.34us, total CPU ~1.985 cores.shm-hybrid: ~98.2k req/s, p50 ~3.20us, total CPU ~1.116 cores.shm-sem: ~98.2k req/s, p50 ~2.94us, total CPU ~0.976 cores.uds-stream: ~98.1k req/s, p50 ~3.71us, total CPU ~1.086 cores.uds-seqpacket: ~98.1k req/s, p50 ~3.97us, total CPU ~0.974 cores.uds-dgram: ~98.4k req/s, p50 ~4.22us, total CPU ~0.974 cores.
RUN_DIRlifecycle and cleanup policy for stale socket files after crashes/restarts.- Current prototype behavior for stale endpoint files (
*.ipcshm):- owner PID alive -> fail with
EADDRINUSE. - owner PID dead/invalid region -> stale endpoint is unlinked and recreated automatically.
- owner PID alive -> fail with
- Pure-Go transport assumptions:
- Go implementation is UDS-only in this phase (no cgo, no SHM polling path).
- Version mismatch hard-failure behavior (no external registry in v1).
- Timeout/retry/backoff semantics for plugin-to-plugin calls.
- Request correlation and cancellation support.
- Backpressure limits (max in-flight requests per connection/client).
- Security boundary assumptions (local machine, same netdata spawning context).
- Primary benchmark (user-defined):
- Counter ping-pong loop: request N, response N+1, strict client-side verification.
- Duration: 5 seconds per run.
- Required counters: requests sent, responses received, latest counter value, mismatch/collision count.
- Required metrics: throughput, latency per request/response, CPU utilization on both client and server.
- Primary ranking axis: single-client single-thread ping-pong quality (latency and CPU/request first, throughput second).
- Additional hypothesis check: fixed-rate single-client ping-pong at 100k req/s to compare CPU utilization across transports.
- Microbenchmarks:
- Use persistent connection/channel semantics for request-response loops (no per-request reconnect).
- Benchmark modes:
- strict single in-flight ping-pong baseline
- ordered-batch advanced-throughput mode for v1
- general pipelining remains a later phase unless batch measurements prove it is still required
- Candidate POSIX methodologies should include at least:
AF_UNIX/SOCK_STREAM(persistent connection).AF_UNIX/SOCK_SEQPACKET(if available on platform).AF_UNIX/SOCK_DGRAM(message-oriented baseline).- Shared memory request/response channel with spinlock and blocking/hybrid synchronization variants.
- 1, 8, 64, 256 concurrent clients.
- Secondary ranking axis: scaling behavior from 1 -> 8 -> 64 -> 256 clients.
- Payload sizes: tiny (32B), small (256B), medium (4KB), large (64KB).
- Metrics: req/s, p50/p95/p99 latency, CPU per process, memory footprint.
- CPU collection: both external process sampling and in-process accounting.
- Compatibility tests:
- Cross-language round-trips for all pairs C/Rust/Go.
- POSIX matrix: Linux/macOS/FreeBSD.
- Windows matrix: Named Pipes implementation.
- Reliability tests:
- Server restarts, stale endpoints, client reconnect behavior.
- Invalid auth attempts and replay attempts.
- API stability tests:
- Schema evolution (backward/forward compatibility checks).
- Architecture doc: plugin IPC model, service discovery, auth model.
- Developer guide: how to expose/consume a service from plugins.
- Protocol spec: framing, message schema, error codes, versioning.
- Ops notes: RUN_DIR socket/pipe lifecycle, troubleshooting, perf tuning.
- User request: run the Linux benchmark for single-client ping-pong performance with no pipelining.
- Scope: execute the existing Linux benchmark path and report the measured results.
- User concern: only one implementation should exist: the library.
- User decision direction: no spin variant.
- User concern: benchmark results are not trustworthy when they measure imaginary/private benchmark transports instead of the reusable library.
- Required outcome: propose how to restructure benchmarking so only library code is measured.
-
Remove all in-tree traces of deleted prototype artifacts
- Remove remaining references in documentation, TODO/history notes, comments, scripts, and build files to deleted prototype-only artifacts.
- Scope is the current working tree only; git history is not being rewritten in this pass.
- Deleted artifact names should disappear from the tree entirely and be replaced with generic wording where historical context must remain.
-
Do not start Netdata integration until the robustness findings from Decision #90 are addressed.
- Source: user decision after robustness review.
- Intent: treat the current state as pre-integration hardening, not integration-ready.
- Clean up
TODO-plugin-ipc.mdso it becomes an active specification/plan instead of a long historical transcript.
- Source: user decision after robustness review.
- Intent: keep the final design, made decisions, hardening gate, plan, testing requirements, and documentation requirements; remove historical journey noise that no longer helps execution.
-
Trace-removal result
- Removed the deleted artifact names and deleted source paths from active build files, documentation, and TODO/history notes.
- Updated
Makefilecleanup to target current generated outputs instead of deleted prototype artifact names. - Verified with repo-wide search that the deleted artifact names no longer appear in the working tree.
- Renamed the Rust benchmark-driver package to remove the residual deleted benchmark substring from package metadata.
- Verified
make cleanstill passes after the cleanup.
-
Build system requirement: the repository must be CMake-based to match Netdata integration.
- Source: user decision "The build system must be based on cmake, because Netdata uses cmake and it will do the integration easier."
- Requirement:
- CMake must remain the top-level and authoritative repository build/test orchestration entrypoint.
- The repository layout and targets should stay easy to embed into Netdata's CMake build.
- Fact:
- Rust crates still require
Cargo.toml. - Go packages/modules still require
go.mod.
- Rust crates still require
- Implication:
Cargo.tomlandgo.modremain native package metadata, but CMake should drive the repo-level build, validation, and benchmark targets.- Avoid introducing parallel non-CMake top-level build systems for repository orchestration.
-
CMake-first execution scope for the current pass
- Gap found:
- top-level CMake builds library/helper binaries, but it does not own validation or benchmark workflows yet.
- shell scripts currently self-configure/self-build, so they are runnable manually but not registered as first-class CMake workflows.
- generated Rust/Go outputs outside the build directory are currently cleaned only by the wrapper
Makefile, not by a CMake target.
- Plan for this pass:
- register library validation scripts as CMake/CTest workflows.
- register benchmark scripts as explicit CMake targets.
- let scripts respect a CMake-driven mode so they can run without recursive configure/build when launched from CMake.
- add a CMake cleanup target for generated non-build-dir artifacts.
- update documentation so CMake is the authoritative repo-level interface.
- Gap found:
-
CMake-first result
- Added CMake workflow targets for validation and benchmarks:
netipc-checknetipc-benchnetipc-validate-allnetipc-clean-generated
- Registered validation scripts with CTest via CMake build targets.
- Updated validation/benchmark scripts to support a CMake-driven mode using:
NETIPC_CMAKE_BUILD_DIRNETIPC_SKIP_CONFIGURENETIPC_SKIP_BUILD
- Validation through CMake passed:
cmake -S . -B buildcmake --build build --target testcmake --build build --target netipc-benchcmake --build build --target netipc-clean-generated
- Environment note:
- the plain
ctestcommand on this workstation resolves to a broken Python shim in~/.local/bin/ctest. cmake --build build --target testis therefore the reliable documented path here.
- the plain
- Added CMake workflow targets for validation and benchmarks:
-
Push decision for the cleanup + CMake-first pass
- Source: user approval "yes push"
- Decision:
- push commit
e57e74donmaintoorigin
- push commit
-
Push blocker after approval
- Fact:
git push origin mainwas rejected withfetch firstbecauseorigin/maincontains commits not present locally.
- Implication:
- local
maincannot be pushed safely without first integrating the remote changes.
- local
- Fact:
-
Inspect the current
origin/mainin a separate clone under/tmp- Source: user direction "clone the main to /tmp/ and check it."
- Goal:
- inspect remote commit
3d56710(Add Windows SHM_HYBRID fast profile) without modifying the current working tree.
- inspect remote commit
-
Integrate local cleanup/CMake pass on top of current
origin/main- Source: user direction "do it" after inspecting the remote
mainclone. - Decision:
- rebase the local cleanup/CMake commit on top of
origin/main - keep the remote Windows
SHM_HYBRIDfast-profile work - resolve overlap in
README.md,TODO-plugin-ipc.md,src/libnetdata/netipc/CMakeLists.txt, andtests/run-live-npipe-smoke.sh
- rebase the local cleanup/CMake commit on top of
- Source: user direction "do it" after inspecting the remote
-
Rebase result
- Rebase of the local cleanup/CMake commit onto
origin/maincompleted cleanly. - New local commit id after rebase:
810d8eb
- Validation on the rebased tree passed through CMake:
cmake -S . -B buildcmake --build build --target testcmake --build build --target netipc-benchcmake --build build --target netipc-clean-generated
- Rebase of the local cleanup/CMake commit onto
-
Automated benchmark document generation requirement
- Source: user requirement to automatically generate
benchmarks-posix.mdandbenchmarks-windows.mdfrom benchmark runs. - Requirements:
- benchmark documents must be regenerated automatically from benchmark execution results.
- the committed benchmark documents must never contain partial results.
- generation must be buffered/staged so replacement happens only after a complete successful run.
- Design constraint:
- POSIX and Windows benchmark matrices are different, so completeness must be validated separately for each document.
- cross-platform generation cannot rely on a single machine unless that machine can run both matrices.
- Source: user requirement to automatically generate
-
Automated benchmark document generation decisions
- Source: user approval "I agree. Do it"
- Decisions:
- Source of truth: staged machine-readable benchmark result files, then generate Markdown from them.
- Trigger model: two explicit scripts, one POSIX and one Windows, independent of CMake orchestration.
- Implementation plan:
- benchmark scripts emit machine-readable results when requested.
- a generator validates completeness for POSIX and Windows independently.
- Markdown files are rendered to temporary paths and atomically renamed only on full success.
- generation is owned by two explicit scripts, not by CMake targets.
-
Benchmark-doc trigger clarification
- Source: user clarification "this does not need to be cmake driven" and "2 scripts: windows, linux"
- Decision:
- keep building CMake-based
- implement benchmark document regeneration as two explicit scripts:
- one script for POSIX
- one script for Windows
- those scripts call the existing benchmark entrypoints and regenerate the markdown docs atomically
- Implementation status:
tests/generate-benchmarks-posix.shimplementedtests/generate-benchmarks-windows.shimplemented- benchmark runner scripts now optionally emit staged machine-readable CSV output via
NETIPC_RESULTS_FILE benchmarks-posix.mdis generated from complete staged benchmark runs only
- Validation:
bash -n tests/generate-benchmarks-posix.sh tests/generate-benchmarks-windows.sh tests/run-live-uds-bench.sh tests/run-live-shm-bench.sh tests/run-negotiated-profile-bench.sh tests/run-live-win-profile-bench.shenv NETIPC_SKIP_CONFIGURE=1 ./tests/generate-benchmarks-posix.sh
- Notes:
- the first POSIX generator run failed before publication because of two implementation bugs; no markdown file was created, confirming the buffered replacement behavior
benchmarks-windows.mdis not generated yet in this Linux session; the Windows generator script is implemented and syntax-checked, but runtime validation still needs a Windows run
-
Direct Rust SHM benchmark coverage gap
- Source: user request "Fix it." after identifying that
benchmarks-posix.mdlacks direct Rustshm-hybridrows. - Facts:
- the Rust library already implements direct POSIX
SHM_HYBRIDtransport insrc/crates/netipc/src/transport/posix.rs. - the current Rust benchmark helper only exposes UDS commands, so
tests/run-live-shm-bench.shcan benchmark only C direct SHM today. tests/run-negotiated-profile-bench.shmeasures Rust SHM only through negotiated UDS profile switching, not direct SHM helper commands.
- the Rust library already implements direct POSIX
- Requirement:
- add direct Rust
shm-hybridhelper commands and include them in the authoritative POSIX SHM benchmark matrix and generated markdown.
- add direct Rust
- Plan:
- inspect the existing Rust
ShmServer/ShmClientlibrary API and current C helper behavior. - implement direct Rust SHM helper commands as thin wrappers over the library.
- extend the POSIX SHM benchmark script and markdown generator to include Rust rows and validate completeness.
- rerun the POSIX generator and update docs.
- inspect the existing Rust
- Source: user request "Fix it." after identifying that
-
Direct Rust SHM interop requirement
- Source: user clarification "It should also have c->rust/rust->c tests".
- Requirement:
- direct POSIX
shm-hybridcoverage must includec->rustandrust->cvalidation, not only same-language Rust helper coverage.
- direct POSIX
- Plan extension:
- inspect the current SHM interop script and replace or extend it to use library-backed C and Rust SHM helpers in both directions.
- keep the authoritative POSIX benchmark markdown focused on benchmark rows, but make sure the direct SHM validation matrix includes cross-language C/Rust directions.
-
Investigate high SHM client CPU at 10k/s
- Source: user request to explain why
shm-hybridshows high client CPU usage at10k/sand whetherspin_triesis set to20. - Requirement:
- after fixing direct Rust SHM benchmark coverage, inspect the actual
spin_triesdefaults and the benchmark helper pacing/client loop behavior forshm-hybrid. - provide an evidence-based explanation, separating facts from any speculation.
- after fixing direct Rust SHM benchmark coverage, inspect the actual
- Implementation status:
tests/fixtures/rust/src/bin/netipc_live_rs.rsnow exposes direct SHM helper roles:server-once,client-once,server-loop,server-bench,client-bench.tests/run-live-shm-bench.shnow runs the authoritative direct POSIXSHM_HYBRIDmatrix forC/Rustin all directed pairs.tests/generate-benchmarks-posix.shnow validates and renders the SHM section as aC/Rustdirected matrix instead of a single C-only row.CMakeLists.txtnow declaresnetipc_live_rsas a dependency of the SHM benchmark workflow.
- Validation:
bash -n tests/run-live-shm-bench.sh tests/generate-benchmarks-posix.sh tests/run-live-uds-bench.shcargo build --release --manifest-path tests/fixtures/rust/Cargo.tomlenv NETIPC_SKIP_CONFIGURE=1 ./tests/run-live-interop.shenv NETIPC_SKIP_CONFIGURE=1 NETIPC_SKIP_BUILD=1 ./tests/run-live-shm-bench.shenv NETIPC_SKIP_CONFIGURE=1 ./tests/generate-benchmarks-posix.sh
- Important fact:
- direct SHM
c->rustandrust->cvalidation already existed intests/run-live-interop.sh; this task fixed the benchmark/helper coverage gap, not a missing interop test path.
- direct SHM
- Source: user request to explain why
-
SHM client CPU investigation findings
- Facts:
- the library default spin window is now
128in both C and Rust:src/libnetdata/netipc/include/netipc/netipc_shm_hybrid.h:NETIPC_SHM_DEFAULT_SPIN_TRIES 128usrc/crates/netipc/src/transport/posix.rs:SHM_DEFAULT_SPIN_TRIES: u32 = 128
- the C SHM helper uses that default in
tests/fixtures/c/netipc_live_posix_c.cviashm_config(...).spin_tries = NETIPC_SHM_DEFAULT_SPIN_TRIES. - the C benchmark pacing loop uses
sleep_until_ns(), which intentionally busy-waits / yields close to the send deadline. - new direct SHM benchmark evidence shows a strong asymmetry at
10k/s:c -> c: client CPU ~0.977, server CPU ~0.031c -> rust: client CPU ~0.978, server CPU ~0.029rust -> c: client CPU ~0.039, server CPU ~0.034rust -> rust: client CPU ~0.037, server CPU ~0.028
- the library default spin window is now
- Conclusion:
- the high
10k/sclient CPU is not explained byspin_tries = 20alone. - the dominant cause is the C helper's rate-pacing strategy in the benchmark client, which stays active around each scheduled send time.
- evidence: when the Rust helper drives the same direct SHM transport at
10k/s, client CPU drops by about25xwhile using the same library transport semantics.
- the high
- Facts:
-
Adaptive client pacing for target-rate benchmarks
- Source: user request "fix the client to use adaptive sleep based on the remaining work. So, make it measure its speed and adapt the sleep interval to achieve the goal rate."
- Requirement:
- benchmark clients should not burn CPU by busy-waiting near every send deadline.
- pacing should adapt based on observed progress versus target throughput.
- this is benchmark-helper behavior only; it must not change library transport semantics.
- Working assumption:
- apply the pacing fix to all benchmark helper clients that implement
target_rps, so comparisons stay fair across languages. - if evidence later shows only the C helper should change, narrow the scope explicitly.
- apply the pacing fix to all benchmark helper clients that implement
- Plan:
- inspect the current pacing loops in C, Rust, and Go benchmark helpers.
- replace deadline-chasing loops with adaptive sleep based on target progress and backlog.
- rerun benchmark validation and compare the low-rate CPU numbers.
- then commit and push the full SHM benchmark fix plus pacing change.
-
Adaptive pacing validation uncovered staged negotiated-profile output bug
- Facts:
tests/generate-benchmarks-posix.shcompleted the staged UDS and SHM matrix runs successfully, then failed during the negotiated-profile stage.- failure mode:
tests/run-negotiated-profile-bench.shexited without producingnegotiated.csv, so the generator aborted without updatingbenchmarks-posix.md. - this confirms the buffered publication rule is working as intended: partial benchmark documents are not written.
- Requirement extension:
- fix the negotiated-profile benchmark script so it emits staged results consistently when
NETIPC_RESULTS_FILEis set, then rerun the generator before commit/push.
- fix the negotiated-profile benchmark script so it emits staged results consistently when
- Facts:
-
Adaptive pacing validation results
- Validation:
cmake -S . -B buildcmake --build build --target netipc-live-c netipc_live_rs netipc_live_uds_rs netipc-live-gobash -n tests/generate-benchmarks-posix.sh tests/generate-benchmarks-windows.sh tests/run-live-uds-bench.sh tests/run-live-shm-bench.sh tests/run-negotiated-profile-bench.sh tests/run-live-win-profile-bench.shenv NETIPC_SKIP_CONFIGURE=1 NETIPC_SKIP_BUILD=1 ./tests/run-live-shm-bench.shenv NETIPC_SKIP_CONFIGURE=1 NETIPC_SKIP_BUILD=1 ./tests/run-live-uds-bench.shenv NETIPC_SKIP_CONFIGURE=1 ./tests/generate-benchmarks-posix.sh
- Result:
- the adaptive pacing change removed the low-rate client CPU artifact across the POSIX benchmark helpers.
- SHM direct
10k/sclient CPU changed from about0.977-0.978in the C-driven rows to about0.034-0.037after the pacing fix. - UDS direct
10k/sC-client rows are now about0.038-0.044, instead of the previous busy-wait behavior around one full core.
- Additional fixes completed as part of validation:
tests/generate-benchmarks-posix.shandtests/generate-benchmarks-windows.shnow preserve the real child exit status in theirrun()wrappers.tests/run-negotiated-profile-bench.shnow uses a12sserver timeout instead of7s, to avoid staged benchmark flakiness near the5sbenchmark window.benchmarks-posix.mdwas regenerated successfully from a complete staged run after these fixes.
- Validation:
-
Validate benchmark trustworthiness and counter correctness
- Source: user concern that the benchmark results may not be trustworthy, specifically whether ping-pong correctness is verified by the final counting, and why direct
rust -> rustSHM is about5M req/swhilec -> cis about3.2M req/s. - Requirement:
- inspect the active benchmark helpers and scripts to verify exactly how request/response correctness is enforced during ping-pong benchmarks.
- determine whether the current benchmark validation guarantees strict counter-chain correctness or only aggregate counts.
- inspect the direct SHM C and Rust helper/library paths to explain the large throughput gap, separating facts from working theory.
- Plan:
- inspect the C and Rust direct SHM benchmark client/server loops and the benchmark scripts.
- confirm whether each round-trip validates
response == request + 1, whether mismatches hard-fail, and whether final handled/request counts are cross-checked. - inspect the direct SHM transport/library code and helper hot paths for asymmetries that can explain the result gap.
- respond with an evidence-based trust assessment and, if needed, propose concrete fixes.
- Source: user concern that the benchmark results may not be trustworthy, specifically whether ping-pong correctness is verified by the final counting, and why direct
-
Default CMake build type
- Source: user question "can we make Release the default?"
- Facts to verify:
- the repo currently configures CMake with no default
CMAKE_BUILD_TYPE, so single-config generators build C targets without optimization unless the user sets a build type explicitly. - Rust helper builds already use
cargo build --release, so the current benchmark/build defaults are inconsistent across languages.
- the repo currently configures CMake with no default
- Requirement:
- inspect the current root CMake behavior and determine the correct way to default to
Releasewithout overriding explicit user choices or breaking multi-config generators.
- inspect the current root CMake behavior and determine the correct way to default to
- Plan:
- inspect the root
CMakeLists.txtand current build cache evidence. - present the safe options, implications, and recommendation before changing code.
- inspect the root
-
Default CMake build type decision
- User decision: use
RelWithDebInfoas the default build type for single-config CMake generators. - Reasoning:
- this keeps the repo optimized by default, avoiding the current unfair unoptimized C benchmark baseline.
- it also keeps debug symbols, which is more practical for crash analysis and transport debugging than pure
Release.
- Implementation scope:
- set the default only when
CMAKE_CONFIGURATION_TYPESis not used andCMAKE_BUILD_TYPEwas not provided explicitly. - reconfigure the local build, verify the active C flags, rerun validation, and regenerate
benchmarks-posix.md.
- set the default only when
- User decision: use
-
Default CMake build type implementation
- Implementation:
- root
CMakeLists.txtnow defaults single-config generators toRelWithDebInfowhen the user did not setCMAKE_BUILD_TYPEexplicitly. README.mdnow documents the default and how to override it with-DCMAKE_BUILD_TYPE=Release.
- root
- Next validation:
- reconfigure the existing build directory to update the cached build type.
- verify that active C flags now include the
RelWithDebInfooptimization/debug flags. - rerun tests and regenerate
benchmarks-posix.mdunder the new default.
- Implementation:
-
Linux pure-Go SHM completion and negotiated SHM matrix widening
- Facts:
- Linux pure-Go POSIX SHM is now implemented in
src/go/pkg/netipc/transport/posix/shm_linux.goon top of the unified SHM control/data layout. - non-Linux Unix now keeps
ProfileUDSSeqpacketonly viasrc/go/pkg/netipc/transport/posix/shm_other.go, so FreeBSD/macOS fall back to UDS as approved. - the widened fake
cgroupsSHM matrix passed forC/Rust/Goproducers and clients viatests/run-live-cgroups-shm.sh. - the widened negotiated UDS SHM matrix initially failed only because
tests/run-live-uds-interop.shstill launched the Go bench driver with the old C-only negotiation CLI.
- Linux pure-Go POSIX SHM is now implemented in
- Fix:
bench/drivers/go/main.gonow honorsNETIPC_SUPPORTED_PROFILES,NETIPC_PREFERRED_PROFILES, andNETIPC_AUTH_TOKENthrough the same env-driven path already used by the Rust helper and the Go fake-service fixture.tests/run-live-uds-interop.shnow drives Go negotiated SHM cases through that env-based override path instead of passing the C-only positional negotiation arguments.
- Validation:
bash tests/run-live-uds-interop.shbash tests/run-live-cgroups-shm.sh/usr/bin/ctest --test-dir build --output-on-failurecd src/go && go test ./...cargo test --manifest-path src/crates/netipc/Cargo.toml
- Result:
- Linux baseline UDS interop is green for the full
C/Rust/Gomatrix. - Linux negotiated SHM
profile=2interop is green for the fullC/Rust/Gomatrix. - Linux fake
cgroupsSHM methodology is green for the fullC/Rust/Goproducer/client matrix.
- Linux baseline UDS interop is green for the full
- Facts:
-
Complete POSIX benchmark picture before continuing snapshot/publish work
- Source: user requirement on 2026-03-13 to see the complete POSIX benchmark picture before proceeding further.
- Facts:
benchmarks-posix.mdcurrently shows:- UDS matrix
- legacy direct SHM matrix (
C/Rustonly) - negotiated profile matrix
- it does not show the newer fake
cgroupssnapshot/cache benchmark, even though:- baseline
C/Go/Rustsnapshot/cache benchmark exists intests/run-live-cgroups-bench.sh - SHM
C/Go/Rustsnapshot/cache benchmark exists intests/run-live-cgroups-shm-bench.sh
- baseline
- therefore the current document cannot answer the user's Go-over-SHM question for the approved snapshot/cache methodology.
- Decision:
- widen
tests/generate-benchmarks-posix.shandbenchmarks-posix.mdso the generated document includes the fakecgroupssnapshot/cache benchmark on:- baseline transports
- SHM transports
- keep the legacy direct
SHM_HYBRIDmatrix too, but treat it as the old low-level ping-pong benchmark scope.
- widen
- Goal:
- make
benchmarks-posix.mdthe complete Linux benchmark picture for both:- low-level transport throughput
- approved snapshot/cache methodology (
C/Go/Rust, baseline and SHM)
- make
- Progress slice (2026-03-13):
tests/run-live-cgroups-bench.shwas benchmark-hardened so server idle-exit is controlled byNETIPC_CGROUPS_SERVER_IDLE_TIMEOUT_MSinstead of racing the harness watchdog at the old 10s timeout.tests/fixtures/c/netipc_cgroups_live.c,tests/fixtures/go/cgroups_server_unix.go,tests/fixtures/go/cgroups_server_windows.go, andtests/fixtures/rust/src/bin/netipc_codec_rs.rsnow honor that benchmark-only idle timeout forserver-loop, whileserver-oncekeeps the longer functional timeout.tests/generate-benchmarks-posix.shwas fixed to:- consume the real CSV exports produced by
tests/run-live-cgroups-bench.shandtests/run-live-cgroups-shm-bench.sh(*.csvsidecar files), - validate four-column cgroups rows correctly (
bench_type,scenario,client,server), - regenerate the document from a complete staged run.
- consume the real CSV exports produced by
- Validation:
bash tests/run-live-cgroups-shm-bench.shenv NETIPC_SKIP_CONFIGURE=1 NETIPC_SKIP_BUILD=1 bash tests/generate-benchmarks-posix.sh
- Result:
benchmarks-posix.mdis now the complete Linux benchmark picture for:- UDS baseline throughput,
- legacy direct low-level SHM throughput (
C/Rustonly), - negotiated profile throughput,
- snapshot/cache baseline refresh and local lookup (
C/Go/Rust), - snapshot/cache SHM refresh and local lookup (
C/Go/Rust).
- The high-speed POSIX path is still clearly high-speed:
- negotiated
profile2-shmmax is about2.96M req/s, - legacy direct SHM max remains about
2.93Mto3.10M req/s, - UDS max remains about
155kto230k req/s.
- negotiated
- The approved snapshot/cache methodology also benefits strongly from SHM for C and Rust:
- baseline refresh is about
198kto211k req/s, - SHM refresh rises to about
2.06Mto2.44M req/s.
- baseline refresh is about
- Linux Go SHM is now benchmarked and correctness-proven, but it is the current performance weak spot:
- mixed Go SHM refresh is about
1.39Mto1.48M req/s, go->goSHM refresh is only about235k req/s,- this is much slower than C/Rust SHM refresh and only modestly above the UDS/baseline class.
- mixed Go SHM refresh is about
- Local cache lookup is excellent for all languages and validates the cache-backed architecture:
- baseline and SHM local lookup stay in the tens of millions of lookups per second,
- the refresh path is the actual performance-critical IPC path.
-
Linux Go SHM refresh performance investigation before snapshot/publish continuation
- Source: user decision on 2026-03-13 after reviewing the completed
benchmarks-posix.md. - Goal:
- explain with evidence why Linux Go SHM refresh throughput is materially lower than C/Rust SHM refresh, especially for
go->go. - fix the performance issue if the cause is in the current Go SHM implementation.
- only continue deeper into snapshot/publish implementation after this gap is understood.
- explain with evidence why Linux Go SHM refresh throughput is materially lower than C/Rust SHM refresh, especially for
- Initial facts:
- the completed POSIX benchmark picture now shows:
c->cfake snapshot SHM refresh max about2.44M req/srust->rustfake snapshot SHM refresh max about2.06M req/s- mixed Go SHM refresh about
1.39Mto1.48M req/s go->gofake snapshot SHM refresh max only about235k req/s
- local lookup after refresh remains fast for Go too, so the hot issue is refresh IPC, not cache lookup.
- the completed POSIX benchmark picture now shows:
- Investigation findings:
- The Go snapshot client and server paths do allocate per refresh, but that alone does not explain the collapse:
- mixed Go SHM rows were already far above the broken
go->gorow.
- mixed Go SHM rows were already far above the broken
- The transport-level asymmetry was stronger:
- C POSIX SHM uses
pausein its spin loop. - Rust POSIX SHM uses
std::hint::spin_loop()in its spin loop. - Go POSIX SHM was spinning without any CPU pause hint in
shmWaitForSequence(...).
- C POSIX SHM uses
- Focused reproducer before the fix:
go->gofake snapshot SHM refresh max about236k req/s.- disabling Go GC did not solve it and actually made it worse, which ruled out "GC alone" as the primary cause.
- Working theory confirmed by the fix:
- two Go peers were busy-spinning against each other much more aggressively than C/Rust peers because the Linux Go SHM spin loop lacked a pause instruction.
- The Go snapshot client and server paths do allocate per refresh, but that alone does not explain the collapse:
- Implemented fix:
- Added a pure-Go-compatible
spinPause()helper on Linux:- amd64 uses a tiny
PAUSEassembly stub. - other Linux architectures use a no-op fallback.
- amd64 uses a tiny
- Wired
spinPause()into the Go POSIX SHM spin loop so it now matches the intent of the C and Rust implementations.
- Added a pure-Go-compatible
- Verified result:
- focused
go->gofake snapshot SHM refresh improved from about236k req/sto about1.16M req/s. - regenerated official
benchmarks-posix.mdnow shows:c->cfake snapshot SHM refresh max2,452,843.760 req/srust->rustfake snapshot SHM refresh max2,011,559.962 req/sgo->cfake snapshot SHM refresh max1,482,550.491 req/sgo->rustfake snapshot SHM refresh max1,475,460.735 req/sgo->gofake snapshot SHM refresh max1,165,422.668 req/s
- local lookup remains fast for all languages, including Go:
- SHM
go->golocal cache lookup max now documented around13.42M lookups/s
- SHM
- focused
- Conclusion:
- the catastrophic Linux Go SHM collapse is fixed.
- Go SHM refresh is still slower than C/Rust SHM refresh, but it is now firmly in the high-speed class and no longer near the UDS/baseline class.
- the remaining gap is likely dominated by Go-side allocation/materialization overhead in the current fake snapshot refresh path, not by a broken SHM synchronization loop.
- Source: user decision on 2026-03-13 after reviewing the completed
-
Pre-Netdata robustness gate
- Source: user requirement on 2026-03-13 before starting real Netdata integration.
- Goal:
- determine whether the library is robust enough for Netdata integration.
- identify concrete remaining risks, missing coverage, weak error-handling paths, abnormal situations, and unproven assumptions.
- avoid integrating into Netdata until these are understood and either fixed or explicitly accepted.
- Required work:
- review the current implementation with a code-review mindset, prioritizing:
- error handling
- abnormal disconnect/restart behavior
- resource cleanup and stale-endpoint recovery
- cross-language protocol mismatch protection
- cache refresh failure semantics
- benchmark/validation blind spots
- validate findings against existing tests and live scripts.
- produce a brutal-honest go/no-go assessment for Netdata integration.
- review the current implementation with a code-review mindset, prioritizing:
- Initial findings (2026-03-13):
- The transport/server layer is still single-client, not the previously discussed multi-client/managed-server design:
- POSIX C server stores one
conn_fd - POSIX Rust server stores one
conn_fd: Option<RawFd> - POSIX Go server stores one accepted connection
- Windows C/Rust/Go servers likewise track a single connected pipe/session
- POSIX C server stores one
- This is acceptable for the current fake
cgroups -> ebpfrehearsal, but it is not the general server model we previously designed for broader plugin-to-plugin use. - The cache-backed
cgroupshelper defaults still bias to baseline transports:- Unix defaults to
UDS_SEQPACKET - Windows defaults to
Named Pipe - SHM requires explicit profile override today
- Unix defaults to
- The cache-backed helper defaults also leave
auth_token = 0unless integration explicitly sets it. - Snapshot/cache validation data is still highly synthetic:
- most helper/library tests and live fixtures use a two-item hard-coded dataset (
system.slice-nginx,docker-1234) - this is enough to validate mechanics, but not enough to prove robustness on large or unusually-shaped real cgroup sets
- most helper/library tests and live fixtures use a two-item hard-coded dataset (
- Library-level negative coverage for the cache helper is still uneven:
- Rust and Go have Unix UDS helper tests for "refresh failure keeps previous cache"
- the same semantics are not covered by equivalent library-level tests on Windows SHM/baseline paths
- the C helper is exercised through live fixtures/scripts, but does not have matching narrow library-level negative tests
- The cache helper default response batch limit is
1000, which matches the current Netdatacgroup_root_maxdefault, but the real Netdata setting is configurable:- any production integration that increases the cgroup cap must also override the helper/transport response batch limit accordingly
- The transport/server layer is still single-client, not the previously discussed multi-client/managed-server design: