Skip to content

feat: add heartbeat watchdog, device revocation pub/sub, rate limitin…#268

Merged
codebestia merged 1 commit into
codebestia:mainfrom
dsdhananjay22:feat/gateway-hardening
Jun 28, 2026
Merged

feat: add heartbeat watchdog, device revocation pub/sub, rate limitin…#268
codebestia merged 1 commit into
codebestia:mainfrom
dsdhananjay22:feat/gateway-hardening

Conversation

@dsdhananjay22

Copy link
Copy Markdown

closes #199
closes #196
closes #197
closes #214

Summary

Hardens the WebSocket gateway with four production-grade features: heartbeat watchdog, cross-instance device revocation, per-socket rate/payload limits, and backpressure shedding.


Changes

1. Heartbeat Watchdog

Problem: Abrupt disconnects (network drop, app kill) leave ghost-online devices. Presence was keyed on a 60s Redis TTL with no server-side enforcement.

Solution: services/heartbeat.ts — Each socket gets a 90s timer. On each heartbeat event from the client the timer resets, Redis presence TTL is refreshed, and devices.updatedAt is bumped (throttled to every 30s). If the timer fires, the device is marked offline in Redis and disconnected.

Before After
Client heartbeat → refreshPresence (60s TTL) only Server-side 90s watchdog + Redis TTL refresh + throttled lastSeenAt
Ghost-online possible after abrupt disconnect Device offline within ~90s, no ghost entries

2. Device Revocation via Redis Pub/Sub

Problem: Revoking a device only took effect on the next auth attempt. Live sockets on other gateway instances were never disconnected.

Solution: services/deviceRevocation.ts — On boot, the gateway subscribes to device_revoked:* on the app-level Redis. When a revocation message arrives, the in-memory revoked set is updated, all sockets for that device are disconnected, and Redis presence mappings are cleaned. A socket.use() middleware rejects any further events from revoked devices.

Before After
Revocation checked only at socket-auth time Mid-session revocation disconnects live socket within seconds, across all nodes
Post-revocation events silently processed Post-revocation events rejected with device_revoked error

3. Rate Limiting & Payload Size Enforcement

Problem: No per-socket event rate limits or payload size caps existed, leaving the gateway vulnerable to flooding and oversized messages.

Solution: services/rateLimit.ts — Redis counter per socket (rl:socket:{id}) with a 1-second sliding window. Exceeding the limit emits a warning; 3 consecutive violations trigger disconnect. Payload size is checked via serialized JSON length before any handler runs. Both limits are configurable via env.

Env Variable Default Purpose
SOCKET_RATE_LIMIT_PER_SEC 10 Max events per socket per second
MAX_PAYLOAD_SIZE 16384 Max event payload bytes

4. Backpressure / Slow Consumer Shedding

Problem: Slow/stalled clients caused unbounded buffering of live events in the server, consuming memory with no recovery path.

Solution: services/backpressure.ts — Every 5 seconds, each socket's WebSocket.bufferedAmount is checked. Above the shed threshold (32KB) the socket is marked as shed; above the disconnect threshold (64KB) the socket is disconnected. Additionally, all non-critical broadcasts (new_message, typing_*, read_receipt, user_online/offline, presence_update) now use volatile.emit() so they are dropped by Engine.IO when the transport buffer is full instead of being queued indefinitely.

Before After
Unbounded event buffering for slow clients Shed slow consumers, let them sync on reconnect
Regular emits queue in memory Volatile emits drop at transport level, buffer monitored
Env Variable Default Purpose
SOCKET_SHED_THRESHOLD 32768 Bytes before shedding begins
SOCKET_BUFFER_THRESHOLD 65536 Bytes before forced disconnect

Files Changed

File Type Lines
src/services/heartbeat.ts New +73
src/services/deviceRevocation.ts New +76
src/services/rateLimit.ts New +53
src/services/backpressure.ts New +86
src/services/presence.ts Modified +9/-1
src/index.ts Modified +85/-9
src/socket/messaging.ts Modified +5/-5
src/__tests__/readReceipts.test.ts Modified +4/-4

Testing

  • All 125 existing tests pass (15 test files)
  • ESLint: zero warnings
  • Prettier: clean
  • TypeScript: no new errors introduced

CI

The .github/workflows/backend-ci.yml pipeline runs Format → Lint → Tests. All three stages pass.

…g, and backpressure

- Heartbeat: server-side 90s timeout marks device offline and expires Redis
  TTLs when heartbeats stop. Throttled lastSeenAt bump via devices.updatedAt.
- Device revocation: subscribe gateways to device_revoked:* Redis channel.
  On receipt, disconnect the device socket immediately, clear Redis mappings,
  and reject post-revocation events via socket middleware.
- Rate limiting: per-socket event/sec counters via Redis with configurable
  SOCKET_RATE_LIMIT_PER_SEC env. Max payload enforcement via MAX_PAYLOAD_SIZE.
  Violations warn, 3rd violation disconnects.
- Backpressure: monitor WebSocket bufferedAmount every 5s. Shed slow consumers
  above SOCKET_SHED_THRESHOLD, disconnect above SOCKET_BUFFER_THRESHOLD.
  Non-critical broadcasts use volatile emit for graceful degradation.
@drips-wave

drips-wave Bot commented Jun 28, 2026

Copy link
Copy Markdown

@dsdhananjay22 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@codebestia codebestia left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thank you for your contribution.

@codebestia codebestia merged commit 6076e6f into codebestia:main Jun 28, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants