Worker logs first control-plane reconnect at WARNING ("retrying in 0s") for expected, self-healing churn

### Summary

`Worker._run` logs the **first** control-plane websocket reconnect at `WARNING`
with `retrying in 0s`, even though that first immediate retry is expected churn
that almost always succeeds. In a long-running deployment this produces a steady
trickle of self-healing `WARNING`s that trip warning-level alerting, with nothing
actionable behind them. It would be better to log the immediate first reconnect
below `WARNING` and only escalate to `WARNING` once a retry actually has to back
off (i.e. from the second attempt / non-zero delay onward).

### Where

`livekit/agents/worker.py`, `Worker._run` (livekit-agents **1.5.17**):

```python
retry_delay = min(retry_count * 2, 10)
retry_count += 1

logger.warning(
    f"failed to connect to livekit, retrying in {retry_delay}s",
    extra={"retry_count": retry_count, "max_retry": self._max_retry, "error": e},
)
await asyncio.sleep(retry_delay)
```

and just after a successful connect:

```python
ws = await self._http_session.ws_connect(...)
retry_count = 0
```

### Why it's always `0s`

Because `retry_count` is reset to `0` the moment a connection succeeds, the first
reconnect after any websocket drop computes `retry_delay = min(0 * 2, 10) = 0s`.
The long-lived control-plane websocket drops fairly often in cloud environments
(LB/proxy idle timeouts, edge nodes recycling, transient network blips). Each drop
emits:

```
failed to connect to livekit, retrying in 0s
```

at `WARNING`, then reconnects immediately and succeeds — so `retry_count` resets
to `0` and it never escalates. The `WARNING` is, in practice, "a websocket
blipped and we already recovered."

We see these a few times a day across worker replicas, every one recovering, no
real outage behind any of them.

### Suggested change

Treat the immediate first reconnect as expected and reserve `WARNING` for retries
that actually have to wait. The simplest fix needs no new config: log below
`WARNING` when `retry_delay == 0` (the first reconnect after `retry_count` resets)
and at `WARNING` from the first backed-off attempt onward:

```python
level = logging.WARNING if retry_delay > 0 else logging.DEBUG
logger.log(
    level,
    f"failed to connect to livekit, retrying in {retry_delay}s",
    extra={"retry_count": retry_count, "max_retry": self._max_retry, "error": e},
)
```

This keeps real, sustained connection problems loud (escalating `2s/4s/.../10s`
retries and the terminal `failed to connect to livekit after N attempts`) while
removing the self-healing first-hop noise.

If you'd rather make it configurable, the natural home is `WorkerOptions` /
`AgentServer` — the worker owns this control-plane connection — e.g. a
`log_connection_retries` toggle, or a minimum-attempt threshold below which
reconnects log at `DEBUG`. It would not belong at the session level: this
websocket is sessionless and lives in the worker's main process, so a per-session
option could never reach it. Either way the level fix above already covers the
common case without new API surface, so a flag is optional.

### Workaround we're using

For now we demote only the literal `failed to connect to livekit, retrying in 0s`
record from `WARNING` to `INFO` via a `logging.Filter` on the `livekit.agents`
logger, leaving every escalating retry and the terminal failure at `WARNING` so a
genuine outage still alerts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker logs first control-plane reconnect at WARNING ("retrying in 0s") for expected, self-healing churn #6108

Summary

Where

Why it's always `0s`

Suggested change

Workaround we're using

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Worker logs first control-plane reconnect at WARNING ("retrying in 0s") for expected, self-healing churn #6108

Description

Summary

Where

Why it's always 0s

Suggested change

Workaround we're using

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why it's always `0s`