NUMA: Cross-Socket Performance Validation

Description

Validate and optimize nimsync's performance on NUMA (Non-Uniform Memory Access) architectures with multiple CPU sockets.

Current Status

Testing: ❌ Not validated on multi-socket systems
Optimization: ⚠️ Unknown if cache-line alignment helps/hurts across sockets
Blocking: Large server deployments (2+ socket systems)

Why NUMA Matters

Modern servers often have multiple CPU sockets:

2-socket systems: AMD EPYC, Intel Xeon (common in cloud)
4-socket systems: High-end servers
8+ socket systems: Specialized HPC

NUMA introduces memory access latency differences:

Local memory: ~70ns access time
Remote socket: ~140ns access time (2x slower!)
Cache effects: Cross-socket cache coherency traffic

Current Unknowns

Does SPSC work well across sockets?
- If producer on socket 0, consumer on socket 1, does 558M ops/sec micro-benchmark hold?
- Or does it degrade significantly due to remote memory access?
Is cache-line alignment (64 bytes) optimal?
- Current padding prevents false sharing on single socket
- But does it cause excessive cache coherency traffic on NUMA?
Should we pin threads to cores?
- Prevents migration across sockets
- But reduces OS flexibility

Testing Needed

Hardware

Access to 2+ socket AMD EPYC or Intel Xeon system
numactl for thread/memory pinning
Hardware performance counters (perf)

Benchmarks

# Same socket (baseline)
numactl --cpunodebind=0 --membind=0 ./benchmark_spsc_simple

# Cross socket (worst case)
# Producer on socket 0, consumer on socket 1
taskset -c 0 ./producer & taskset -c 64 ./consumer

# Measure:
# - Throughput degradation
# - Latency increase
# - Cache miss rates (perf stat -e LLC-load-misses)

Expected Outcomes

Quantify NUMA penalty: "Cross-socket reduces throughput by X%"
Optimization guide: "For best performance on NUMA, do Y"
Code changes if needed:
- NUMA-aware allocation (numa_alloc_onnode)
- Socket-specific optimizations
- Documentation on thread pinning

Acceptance Criteria

Benchmarks run on 2-socket system
Document same-socket vs cross-socket performance
Recommendations for NUMA deployments
(Optional) NUMA-aware channel allocation API
CI tests on NUMA hardware (if available)

Reference Implementations

DPDK: Heavily NUMA-optimized, good patterns to study
ScyllaDB: Sharded architecture for NUMA
LMAX Disruptor: NUMA considerations in ring buffer

Help Wanted

Skills needed: NUMA architecture understanding, systems programming, performance analysis

Resources:

man numa and man numactl
Intel's NUMA optimization guide
AMD EPYC tuning guide

Hardware access: This is the blocker - need access to multi-socket system for testing

Priority: Medium 🟡 (not blocking single-socket deployments) Difficulty: Medium 🟡 (testing complexity, not implementation) Impact: Medium 🟡 (only affects large server deployments)

Current Workaround

For now, users on NUMA systems should:

Pin producer/consumer to same socket
Use one channel per socket
Benchmark their specific workload

But proper validation and docs would be better!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUMA: Cross-Socket Performance Validation

Description

Current Status

Why NUMA Matters

Current Unknowns

Testing Needed

Hardware

Benchmarks

Expected Outcomes

Acceptance Criteria

Reference Implementations

Help Wanted

Current Workaround

FilesExpand file tree

ISSUE_NUMA_VALIDATION.md

Latest commit

History

ISSUE_NUMA_VALIDATION.md

File metadata and controls

NUMA: Cross-Socket Performance Validation

Description

Current Status

Why NUMA Matters

Current Unknowns

Testing Needed

Hardware

Benchmarks

Expected Outcomes

Acceptance Criteria

Reference Implementations

Help Wanted

Current Workaround