Validate and optimize nimsync's performance on NUMA (Non-Uniform Memory Access) architectures with multiple CPU sockets.
- Testing: ❌ Not validated on multi-socket systems
- Optimization:
⚠️ Unknown if cache-line alignment helps/hurts across sockets - Blocking: Large server deployments (2+ socket systems)
Modern servers often have multiple CPU sockets:
- 2-socket systems: AMD EPYC, Intel Xeon (common in cloud)
- 4-socket systems: High-end servers
- 8+ socket systems: Specialized HPC
NUMA introduces memory access latency differences:
- Local memory: ~70ns access time
- Remote socket: ~140ns access time (2x slower!)
- Cache effects: Cross-socket cache coherency traffic
-
Does SPSC work well across sockets?
- If producer on socket 0, consumer on socket 1, does 558M ops/sec micro-benchmark hold?
- Or does it degrade significantly due to remote memory access?
-
Is cache-line alignment (64 bytes) optimal?
- Current padding prevents false sharing on single socket
- But does it cause excessive cache coherency traffic on NUMA?
-
Should we pin threads to cores?
- Prevents migration across sockets
- But reduces OS flexibility
- Access to 2+ socket AMD EPYC or Intel Xeon system
numactlfor thread/memory pinning- Hardware performance counters (perf)
# Same socket (baseline)
numactl --cpunodebind=0 --membind=0 ./benchmark_spsc_simple
# Cross socket (worst case)
# Producer on socket 0, consumer on socket 1
taskset -c 0 ./producer & taskset -c 64 ./consumer
# Measure:
# - Throughput degradation
# - Latency increase
# - Cache miss rates (perf stat -e LLC-load-misses)- Quantify NUMA penalty: "Cross-socket reduces throughput by X%"
- Optimization guide: "For best performance on NUMA, do Y"
- Code changes if needed:
- NUMA-aware allocation (
numa_alloc_onnode) - Socket-specific optimizations
- Documentation on thread pinning
- NUMA-aware allocation (
- Benchmarks run on 2-socket system
- Document same-socket vs cross-socket performance
- Recommendations for NUMA deployments
- (Optional) NUMA-aware channel allocation API
- CI tests on NUMA hardware (if available)
- DPDK: Heavily NUMA-optimized, good patterns to study
- ScyllaDB: Sharded architecture for NUMA
- LMAX Disruptor: NUMA considerations in ring buffer
Skills needed: NUMA architecture understanding, systems programming, performance analysis
Resources:
man numaandman numactl- Intel's NUMA optimization guide
- AMD EPYC tuning guide
Hardware access: This is the blocker - need access to multi-socket system for testing
Priority: Medium 🟡 (not blocking single-socket deployments) Difficulty: Medium 🟡 (testing complexity, not implementation) Impact: Medium 🟡 (only affects large server deployments)
For now, users on NUMA systems should:
- Pin producer/consumer to same socket
- Use one channel per socket
- Benchmark their specific workload
But proper validation and docs would be better!