Summary
Worker._run logs the first control-plane websocket reconnect at WARNING
with retrying in 0s, even though that first immediate retry is expected churn
that almost always succeeds. In a long-running deployment this produces a steady
trickle of self-healing WARNINGs that trip warning-level alerting, with nothing
actionable behind them. It would be better to log the immediate first reconnect
below WARNING and only escalate to WARNING once a retry actually has to back
off (i.e. from the second attempt / non-zero delay onward).
Where
livekit/agents/worker.py, Worker._run (livekit-agents 1.5.17):
retry_delay = min(retry_count * 2, 10)
retry_count += 1
logger.warning(
f"failed to connect to livekit, retrying in {retry_delay}s",
extra={"retry_count": retry_count, "max_retry": self._max_retry, "error": e},
)
await asyncio.sleep(retry_delay)
and just after a successful connect:
ws = await self._http_session.ws_connect(...)
retry_count = 0
Why it's always 0s
Because retry_count is reset to 0 the moment a connection succeeds, the first
reconnect after any websocket drop computes retry_delay = min(0 * 2, 10) = 0s.
The long-lived control-plane websocket drops fairly often in cloud environments
(LB/proxy idle timeouts, edge nodes recycling, transient network blips). Each drop
emits:
failed to connect to livekit, retrying in 0s
at WARNING, then reconnects immediately and succeeds — so retry_count resets
to 0 and it never escalates. The WARNING is, in practice, "a websocket
blipped and we already recovered."
We see these a few times a day across worker replicas, every one recovering, no
real outage behind any of them.
Suggested change
Treat the immediate first reconnect as expected and reserve WARNING for retries
that actually have to wait. The simplest fix needs no new config: log below
WARNING when retry_delay == 0 (the first reconnect after retry_count resets)
and at WARNING from the first backed-off attempt onward:
level = logging.WARNING if retry_delay > 0 else logging.DEBUG
logger.log(
level,
f"failed to connect to livekit, retrying in {retry_delay}s",
extra={"retry_count": retry_count, "max_retry": self._max_retry, "error": e},
)
This keeps real, sustained connection problems loud (escalating 2s/4s/.../10s
retries and the terminal failed to connect to livekit after N attempts) while
removing the self-healing first-hop noise.
If you'd rather make it configurable, the natural home is WorkerOptions /
AgentServer — the worker owns this control-plane connection — e.g. a
log_connection_retries toggle, or a minimum-attempt threshold below which
reconnects log at DEBUG. It would not belong at the session level: this
websocket is sessionless and lives in the worker's main process, so a per-session
option could never reach it. Either way the level fix above already covers the
common case without new API surface, so a flag is optional.
Workaround we're using
For now we demote only the literal failed to connect to livekit, retrying in 0s
record from WARNING to INFO via a logging.Filter on the livekit.agents
logger, leaving every escalating retry and the terminal failure at WARNING so a
genuine outage still alerts.
Summary
Worker._runlogs the first control-plane websocket reconnect atWARNINGwith
retrying in 0s, even though that first immediate retry is expected churnthat almost always succeeds. In a long-running deployment this produces a steady
trickle of self-healing
WARNINGs that trip warning-level alerting, with nothingactionable behind them. It would be better to log the immediate first reconnect
below
WARNINGand only escalate toWARNINGonce a retry actually has to backoff (i.e. from the second attempt / non-zero delay onward).
Where
livekit/agents/worker.py,Worker._run(livekit-agents 1.5.17):and just after a successful connect:
Why it's always
0sBecause
retry_countis reset to0the moment a connection succeeds, the firstreconnect after any websocket drop computes
retry_delay = min(0 * 2, 10) = 0s.The long-lived control-plane websocket drops fairly often in cloud environments
(LB/proxy idle timeouts, edge nodes recycling, transient network blips). Each drop
emits:
at
WARNING, then reconnects immediately and succeeds — soretry_countresetsto
0and it never escalates. TheWARNINGis, in practice, "a websocketblipped and we already recovered."
We see these a few times a day across worker replicas, every one recovering, no
real outage behind any of them.
Suggested change
Treat the immediate first reconnect as expected and reserve
WARNINGfor retriesthat actually have to wait. The simplest fix needs no new config: log below
WARNINGwhenretry_delay == 0(the first reconnect afterretry_countresets)and at
WARNINGfrom the first backed-off attempt onward:This keeps real, sustained connection problems loud (escalating
2s/4s/.../10sretries and the terminal
failed to connect to livekit after N attempts) whileremoving the self-healing first-hop noise.
If you'd rather make it configurable, the natural home is
WorkerOptions/AgentServer— the worker owns this control-plane connection — e.g. alog_connection_retriestoggle, or a minimum-attempt threshold below whichreconnects log at
DEBUG. It would not belong at the session level: thiswebsocket is sessionless and lives in the worker's main process, so a per-session
option could never reach it. Either way the level fix above already covers the
common case without new API surface, so a flag is optional.
Workaround we're using
For now we demote only the literal
failed to connect to livekit, retrying in 0srecord from
WARNINGtoINFOvia alogging.Filteron thelivekit.agentslogger, leaving every escalating retry and the terminal failure at
WARNINGso agenuine outage still alerts.