feat: add heartbeat watchdog, device revocation pub/sub, rate limitin…#268
Merged
Merged
Conversation
…g, and backpressure - Heartbeat: server-side 90s timeout marks device offline and expires Redis TTLs when heartbeats stop. Throttled lastSeenAt bump via devices.updatedAt. - Device revocation: subscribe gateways to device_revoked:* Redis channel. On receipt, disconnect the device socket immediately, clear Redis mappings, and reject post-revocation events via socket middleware. - Rate limiting: per-socket event/sec counters via Redis with configurable SOCKET_RATE_LIMIT_PER_SEC env. Max payload enforcement via MAX_PAYLOAD_SIZE. Violations warn, 3rd violation disconnects. - Backpressure: monitor WebSocket bufferedAmount every 5s. Shed slow consumers above SOCKET_SHED_THRESHOLD, disconnect above SOCKET_BUFFER_THRESHOLD. Non-critical broadcasts use volatile emit for graceful degradation.
|
@dsdhananjay22 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
codebestia
reviewed
Jun 28, 2026
codebestia
left a comment
Owner
There was a problem hiding this comment.
LGTM!
Thank you for your contribution.
codebestia
approved these changes
Jun 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes #199
closes #196
closes #197
closes #214
Summary
Hardens the WebSocket gateway with four production-grade features: heartbeat watchdog, cross-instance device revocation, per-socket rate/payload limits, and backpressure shedding.
Changes
1. Heartbeat Watchdog
Problem: Abrupt disconnects (network drop, app kill) leave ghost-online devices. Presence was keyed on a 60s Redis TTL with no server-side enforcement.
Solution:
services/heartbeat.ts— Each socket gets a 90s timer. On eachheartbeatevent from the client the timer resets, Redis presence TTL is refreshed, anddevices.updatedAtis bumped (throttled to every 30s). If the timer fires, the device is marked offline in Redis and disconnected.refreshPresence(60s TTL) onlylastSeenAt2. Device Revocation via Redis Pub/Sub
Problem: Revoking a device only took effect on the next auth attempt. Live sockets on other gateway instances were never disconnected.
Solution:
services/deviceRevocation.ts— On boot, the gateway subscribes todevice_revoked:*on the app-level Redis. When a revocation message arrives, the in-memory revoked set is updated, all sockets for that device are disconnected, and Redis presence mappings are cleaned. Asocket.use()middleware rejects any further events from revoked devices.device_revokederror3. Rate Limiting & Payload Size Enforcement
Problem: No per-socket event rate limits or payload size caps existed, leaving the gateway vulnerable to flooding and oversized messages.
Solution:
services/rateLimit.ts— Redis counter per socket (rl:socket:{id}) with a 1-second sliding window. Exceeding the limit emits a warning; 3 consecutive violations trigger disconnect. Payload size is checked via serialized JSON length before any handler runs. Both limits are configurable via env.SOCKET_RATE_LIMIT_PER_SECMAX_PAYLOAD_SIZE4. Backpressure / Slow Consumer Shedding
Problem: Slow/stalled clients caused unbounded buffering of live events in the server, consuming memory with no recovery path.
Solution:
services/backpressure.ts— Every 5 seconds, each socket'sWebSocket.bufferedAmountis checked. Above the shed threshold (32KB) the socket is marked as shed; above the disconnect threshold (64KB) the socket is disconnected. Additionally, all non-critical broadcasts (new_message,typing_*,read_receipt,user_online/offline,presence_update) now usevolatile.emit()so they are dropped by Engine.IO when the transport buffer is full instead of being queued indefinitely.SOCKET_SHED_THRESHOLDSOCKET_BUFFER_THRESHOLDFiles Changed
src/services/heartbeat.tssrc/services/deviceRevocation.tssrc/services/rateLimit.tssrc/services/backpressure.tssrc/services/presence.tssrc/index.tssrc/socket/messaging.tssrc/__tests__/readReceipts.test.tsTesting
CI
The
.github/workflows/backend-ci.ymlpipeline runs Format → Lint → Tests. All three stages pass.