Skip to content

Worker logs first control-plane reconnect at WARNING ("retrying in 0s") for expected, self-healing churn #6108

@marctorsoc

Description

@marctorsoc

Summary

Worker._run logs the first control-plane websocket reconnect at WARNING
with retrying in 0s, even though that first immediate retry is expected churn
that almost always succeeds. In a long-running deployment this produces a steady
trickle of self-healing WARNINGs that trip warning-level alerting, with nothing
actionable behind them. It would be better to log the immediate first reconnect
below WARNING and only escalate to WARNING once a retry actually has to back
off (i.e. from the second attempt / non-zero delay onward).

Where

livekit/agents/worker.py, Worker._run (livekit-agents 1.5.17):

retry_delay = min(retry_count * 2, 10)
retry_count += 1

logger.warning(
    f"failed to connect to livekit, retrying in {retry_delay}s",
    extra={"retry_count": retry_count, "max_retry": self._max_retry, "error": e},
)
await asyncio.sleep(retry_delay)

and just after a successful connect:

ws = await self._http_session.ws_connect(...)
retry_count = 0

Why it's always 0s

Because retry_count is reset to 0 the moment a connection succeeds, the first
reconnect after any websocket drop computes retry_delay = min(0 * 2, 10) = 0s.
The long-lived control-plane websocket drops fairly often in cloud environments
(LB/proxy idle timeouts, edge nodes recycling, transient network blips). Each drop
emits:

failed to connect to livekit, retrying in 0s

at WARNING, then reconnects immediately and succeeds — so retry_count resets
to 0 and it never escalates. The WARNING is, in practice, "a websocket
blipped and we already recovered."

We see these a few times a day across worker replicas, every one recovering, no
real outage behind any of them.

Suggested change

Treat the immediate first reconnect as expected and reserve WARNING for retries
that actually have to wait. The simplest fix needs no new config: log below
WARNING when retry_delay == 0 (the first reconnect after retry_count resets)
and at WARNING from the first backed-off attempt onward:

level = logging.WARNING if retry_delay > 0 else logging.DEBUG
logger.log(
    level,
    f"failed to connect to livekit, retrying in {retry_delay}s",
    extra={"retry_count": retry_count, "max_retry": self._max_retry, "error": e},
)

This keeps real, sustained connection problems loud (escalating 2s/4s/.../10s
retries and the terminal failed to connect to livekit after N attempts) while
removing the self-healing first-hop noise.

If you'd rather make it configurable, the natural home is WorkerOptions /
AgentServer — the worker owns this control-plane connection — e.g. a
log_connection_retries toggle, or a minimum-attempt threshold below which
reconnects log at DEBUG. It would not belong at the session level: this
websocket is sessionless and lives in the worker's main process, so a per-session
option could never reach it. Either way the level fix above already covers the
common case without new API surface, so a flag is optional.

Workaround we're using

For now we demote only the literal failed to connect to livekit, retrying in 0s
record from WARNING to INFO via a logging.Filter on the livekit.agents
logger, leaving every escalating retry and the terminal failure at WARNING so a
genuine outage still alerts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions