Skip to content

Add ALB idle timeout configuration guidance for WebSocket workloads #8930

@sadohert

Description

@sadohert

Summary

The AWS ALB default idle timeout is 60 seconds. The Mattermost server sends WebSocket ping frames every 60 seconds. This creates a race condition where the ALB may drop a WebSocket connection at exactly the moment a ping is due, causing the client to see a ~10s TCP timeout on the next write.

Evidence

Lab load test with default 60s ALB timeout showed P99 latency of 8.9s and 61% timeout error rate, preventing the coordinator from scaling past ~600 users. The same workload through nginx (proxy_read_timeout 90s) ran to 1,500 users without issue. Mattermost WebSocket client source confirms the server-side ping interval is exactly 60s.

Recommendation

Add a configuration note to the AWS deployment / load balancer documentation recommending ALB idle timeout be set to 300 seconds (5 minutes):

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn <alb-arn> \
  --attributes Key=idle_timeout.timeout_seconds,Value=300

Or via AWS Console: EC2 → Load Balancers → select ALB → Attributes → Edit → Idle timeout → 300.

Additional Investigation Note

Pull CloudWatch metrics during next high-load event (e.g. 5am EST spike):

  • ActiveConnectionCount
  • ConsumedLCUs
  • RejectedConnectionCount
  • TargetResponseTime

If RejectedConnectionCount > 0 or ConsumedLCUs is near limits, an NLB may be a better long-term fit for this WebSocket-heavy workload.

Workstream

Implementation & Onboarding — validated during Midmarket HA Postgres load testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions