Skip to content

Commit f31bb28

Browse files
feat: hash-ring routing, sharded locks, background eviction & AOF
Architecture: - Hash-ring partitioned routing (SET/GET route to owner, transparent proxy) - Direct HTTP replication to N hash-ring replicas (not gossip broadcast) - ShardedMap with 32 independent lock shards (eliminates global mutex) - Background eviction goroutine (never blocks write path) - Background AOF via 10K buffered channel (off critical path) - PerKeyOverhead (500B) for accurate memory tracking - GetRawBytes() zero-copy GET for RESP (no deserialization for strings) - DNS-based seed discovery (seed_dns config for K8s/Docker) - syncExistingMembers() fixes join-before-subscribe race Performance (redis-benchmark, M3 Pro): - SET 50c: 75K -> 172K/s (+128%) - SET pipeline: 96K -> 242K/s (+151%, ceiling eliminated) - GET pipeline: 413K -> 459K/s (+11%) Cleanup: - Deleted 7 unused/broken scripts - Updated README, WHITEPAPER, CLASS_DOCUMENTATION, USER_FACING_BOTTLENECKS - All unit tests pass with -race, 171 Postman integration tests pass
1 parent bd7a490 commit f31bb28

31 files changed

Lines changed: 4726 additions & 2407 deletions

Makefile

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,25 @@ test-coverage-html: test-unit
3737
$(GO) tool cover -html=test-results/coverage.out -o test-results/coverage.html
3838
@echo "Coverage report: test-results/coverage.html"
3939

40-
## bench: Run benchmarks
40+
## bench: Run micro-benchmarks (caps each at 3s to avoid runaway iterations)
4141
bench:
42-
$(GO) test -bench=. -benchmem ./internal/...
42+
$(GO) test -bench=. -benchmem -benchtime=3s -timeout=10m ./internal/...
43+
44+
## bench-production: Run production benchmark suite (persistence, workloads, payload sizes, GC pressure)
45+
bench-production:
46+
$(GO) test -bench=. -benchmem -benchtime=1s -timeout=30m -run=^$$ ./tests/benchmarks/...
47+
48+
## bench-server: Run redis-benchmark against a live HyperCache server (no Docker needed)
49+
bench-server:
50+
bash scripts/run-server-benchmarks.sh
51+
52+
## stress: Run stress tests (memory exhaustion, thundering herd, persistence recovery, sustained load)
53+
stress:
54+
$(GO) test -v -timeout=10m ./tests/stress/...
55+
56+
## stress-long: Run extended stress tests (set STRESS_DURATION for sustained load, default 5min)
57+
stress-long:
58+
STRESS_DURATION=5m $(GO) test -v -timeout=30m -run=TestStress ./tests/stress/...
4359

4460
## lint: Run golangci-lint
4561
lint:

README.md

Lines changed: 70 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,12 @@
2121
## 🎯 **Latest Features**
2222

2323
**Production-ready distributed cache with full observability stack:**
24-
- ✅ Multi-node cluster deployment with full replication
24+
- ✅ Multi-node cluster deployment with hash-ring partitioned routing
2525
- ✅ Full Redis client compatibility (RESP protocol)
2626
- ✅ Lamport timestamps for causal ordering of distributed writes
27-
- ✅ Read-repair for gossip propagation window
28-
-Early Cuckoo filter sync across nodes
29-
- ✅ Enterprise persistence (AOF + Snapshots)
27+
- ✅ Read-repair for replication propagation window
28+
-Sharded locks (32 independent shards) for high-concurrency writes
29+
- ✅ Enterprise persistence (AOF + Snapshots) with background writes
3030
- ✅ Structured JSON logging with correlation ID tracing
3131
- ✅ Real-time monitoring with Grafana + Elasticsearch
3232
- ✅ HTTP API + RESP protocol support
@@ -309,17 +309,18 @@ make deps Download and tidy dependencies
309309
- Multi-store commands: SELECT, STORES
310310

311311
### **Distributed Resilience**
312-
- **Full Replication**: Every node stores every key — maximum availability, any node serves any request
313-
- **Lamport Timestamps**: Logical clocks for causal ordering of distributed operations. Stale writes from out-of-order gossip are automatically rejected
314-
- **Read-Repair**: On local cache miss, peer nodes are queried before returning 404. Bridges the gossip propagation window (~50-500ms) so clients never see stale misses
315-
- **Early Cuckoo Filter Sync**: Filter is updated immediately on gossip receive, before data is written. Eliminates false "definitely not here" rejections during replication lag
316-
- **Idempotent Replication**: DELETE on a missing key is a no-op, not an error. Designed for eventual consistency
312+
- **Hash-Ring Routing**: Consistent hashing with 256 virtual nodes routes each key to its primary owner. Non-owner nodes transparently proxy requests to the correct node
313+
- **Targeted Replication**: Writes replicate to N hash-ring replicas (default 3) via direct HTTP — not gossip broadcast to all nodes
314+
- **Lamport Timestamps**: Logical clocks for causal ordering of distributed operations. Stale writes from out-of-order replication are automatically rejected
315+
- **Read-Repair**: On local cache miss, hash-ring replicas are queried before returning 404. Bridges the replication propagation window
316+
- **Sharded Concurrency**: 32 independent lock shards eliminate the global mutex bottleneck. Each key locks only its shard
317+
- **Background Eviction**: Memory pressure triggers a background evictor goroutine — eviction never blocks the write path
317318
- **Correlation ID Tracing**: Every request gets a unique ID that flows across all nodes for end-to-end debugging
318319

319320
### **Enterprise Persistence & Recovery**
320321
- **Hybrid Persistence**: AOF (Append-Only File) + Snapshot dual strategy
322+
- **Background AOF**: 10,000-entry buffered channel — writes never block the SET critical path
321323
- **Configurable per Store**: Each data store can have independent persistence policies
322-
- **Sub-microsecond Writes**: AOF logging with low-latency write path
323324
- **Fast Recovery**: Complete data restoration from AOF replay + snapshot loading
324325
- **Snapshot Support**: Point-in-time recovery with configurable intervals
325326
- **Durability Guarantees**: Configurable sync policies (always, everysec, no)
@@ -342,7 +343,8 @@ make deps Download and tidy dependencies
342343

343344
### **Advanced Memory Management**
344345
- **Per-Store Eviction Policies**: Independent LRU, LFU, or session-based eviction per store
345-
- **Smart Memory Pool**: Pressure monitoring (warning/critical/panic) with automatic cleanup
346+
- **Smart Memory Pool**: Pressure monitoring (warning/critical/panic) with background eviction
347+
- **Accurate Tracking**: 500-byte per-key overhead included in memory accounting (map bucket + struct + pointers)
346348
- **Real-time Usage Tracking**: Memory statistics and structured alerts
347349
- **Configurable Limits**: Store-specific memory boundaries
348350

@@ -371,8 +373,10 @@ make deps Download and tidy dependencies
371373
HyperCache/
372374
├── cmd/hypercache/ # Server entry point
373375
├── scripts/ # Deployment and management scripts
374-
│ ├── start-system.sh # Complete system launcher
375-
│ ├── build-and-run.sh # Build and cluster management
376+
│ ├── start-3node-local.sh # Local 3-node integration testing
377+
│ ├── start-cluster.sh # Production cluster launcher
378+
│ ├── add-node.sh # Add node to running cluster
379+
│ ├── run-server-benchmarks.sh # redis-benchmark test suite
376380
│ └── clean-*.sh # Cleanup utilities
377381
├── configs/ # Node configuration files
378382
│ ├── node1-config.yaml # Node 1 configuration
@@ -535,9 +539,11 @@ See [docs/README.md](docs/README.md) for the full documentation index:
535539
536540
### Clean Up
537541
```bash
538-
# Stop all services
539-
./scripts/build-and-run.sh stop
540-
docker-compose -f docker-compose.logging.yml down
542+
# Stop all local nodes
543+
pkill -f bin/hypercache
544+
545+
# Stop Docker cluster
546+
docker compose -f docker-compose.cluster.yml down
541547
542548
# Clean persistence data
543549
./scripts/clean-persistence.sh --all
@@ -550,17 +556,19 @@ docker-compose -f docker-compose.logging.yml down
550556

551557
### System Configuration
552558
```bash
553-
# Start complete system with monitoring
554-
./scripts/start-system.sh --all
559+
# Start 3-node local cluster for development/testing
560+
./scripts/start-3node-local.sh
555561

556-
# Start only cluster
557-
./scripts/start-system.sh --cluster
562+
# Start custom N-node cluster
563+
make cluster NODES=5
558564

559-
# Start only monitoring
560-
./scripts/start-system.sh --monitor
565+
# Stop cluster
566+
make cluster-stop
567+
pkill -f bin/hypercache
561568

562-
# Clean data and restart
563-
./scripts/start-system.sh --clean --all
569+
# Docker cluster with full monitoring stack
570+
docker compose -f docker-compose.cluster.yml up -d
571+
docker compose -f docker-compose.cluster.yml down
564572
```
565573

566574
### Node Configuration
@@ -575,6 +583,12 @@ network:
575583
http_port: 9080
576584
gossip_port: 7946
577585

586+
cluster:
587+
seeds: ["10.0.1.10:7946"] # Static seeds (IP:port or hostname)
588+
seed_dns: "" # DNS-based discovery (K8s headless Service)
589+
seed_dns_port: 7946 # Port for DNS-discovered seeds
590+
replication_factor: 3
591+
578592
cache:
579593
max_memory: "8GB"
580594
default_ttl: "0" # 0 = infinite (no expiry); set per-store or per-key
@@ -587,6 +601,13 @@ persistence:
587601
sync_policy: "everysec" # "always", "everysec", "no"
588602
```
589603
604+
**Seed Discovery Modes:**
605+
| Mode | Config | Use Case |
606+
|------|--------|----------|
607+
| Static | `seeds: ["ip:port"]` | Manual deployments |
608+
| DNS | `seed_dns: "headless-svc.ns.svc.cluster.local"` | Kubernetes StatefulSet |
609+
| Hostname | `seeds: ["node-1"]` | Docker Compose (auto-resolves via Docker DNS) |
610+
590611
### Environment Variable Overrides (Docker / K8s)
591612
Environment variables have highest priority and override both defaults and YAML config:
592613

@@ -868,13 +889,19 @@ curl -X PUT http://localhost:9080/api/cache/feature:new_ui \
868889
git clone <your-repository-url>
869890
cd Cache
870891
871-
# Quick start - everything in one command
872-
./scripts/start-system.sh
892+
# Build
893+
make build
894+
895+
# Quick start — local 3-node cluster
896+
./scripts/start-3node-local.sh
897+
898+
# Or Docker with full monitoring stack
899+
docker compose -f docker-compose.cluster.yml up -d
873900
874901
# Access your system:
875-
# - Grafana: http://localhost:3000 (admin/admin123)
876-
# - API: http://localhost:9080/api/cache/
877-
# - Redis: localhost:8080 (redis-cli -p 8080)
902+
# - HTTP API: http://localhost:9080/api/cache/
903+
# - RESP: redis-cli -p 8080
904+
# - Grafana: http://localhost:3000 (admin/admin123, Docker only)
878905
```
879906

880907
### First Steps
@@ -886,18 +913,23 @@ cd Cache
886913
### Development Workflow
887914
```bash
888915
# Build and test
889-
go build -o bin/hypercache cmd/hypercache/main.go
890-
go test ./internal/... -v
916+
make build
917+
make test-unit
891918
892-
# Start development cluster
893-
./scripts/build-and-run.sh cluster
919+
# Start 3-node development cluster
920+
./scripts/start-3node-local.sh
894921
895-
# View logs
896-
tail -f logs/*.log
922+
# Run Postman integration tests
923+
# Import HyperCache.postman_collection.json → Run Collection
924+
925+
# Run server benchmarks (requires running server)
926+
make bench-server
927+
928+
# Stop cluster
929+
pkill -f bin/hypercache
897930
898-
# Stop everything
899-
./scripts/build-and-run.sh stop
900-
docker-compose -f docker-compose.logging.yml down
931+
# View logs (Docker)
932+
docker compose -f docker-compose.logging.yml up -d
901933
```
902934

903935
## 📚 **Documentation**

0 commit comments

Comments
 (0)