Skip to content

Fix/1.9.6 prod observations#168

Open
anderslindho wants to merge 7 commits into
masterfrom
fix/1.9.6-prod-observations
Open

Fix/1.9.6 prod observations#168
anderslindho wants to merge 7 commits into
masterfrom
fix/1.9.6-prod-observations

Conversation

@anderslindho
Copy link
Copy Markdown
Contributor

This MR fixes a bunch of issues seen in production at ESS when deploying recCeiver 1.9.6 (and CF 5.1.0) with cleanOnStart active and freshly wiped DB.

  • Expected ~9 M channels; CF showed ~4.75 M total, ~1.75 M active
  • Active channel count climbed to ~2.1 M then dropped sharply to ~1.55 M within minutes of startup; never recovered
  • recceiver-feb (largest instance, ~2500 IOC network): known_iocs ca 100 during incident; after manual restart climbed to expected level then crashed
  • recceiver-ps: dropped from 387 to ~30 known IOCs in ~90 s mid-incident; cause unconfirmed (possible container restart, but not verified)
  • 2 recceivers presenting 0 IOCs - there was, however, potentially 0 IOCs on those networks
  • saw in logs many channelCount 0 - mainly maybe for feb

From prometheus exporter:

# HELP recceiver_connections_active Active uploading IOC connections
# TYPE recceiver_connections_active gauge
recceiver_connections_active -1.0

Also in logs:

2026-06-03T12:55:27+0000 [-] INFO:recceiver.application status: connections active=-60328/20 queued=0

Note that this MR adds a recceiver-clean utility. This is because we at ESS have decided to not use cleanOn* anymore - this is not, and never was, recCeiver's scope. It was a band-aid which we do not want to attempt fixing any further. We will instead use the utility as needed, and try to integrate better mechanisms in a future CF version or any potential CF replacement.

recvDone called isDone(active=True) to free the connection slot, but
never cleared self.active. connectionLost then called isDone(active=True)
again, causing a second decrement or waiter promotion per completed
upload. After N uploads NActive drifted to -N, maxActive throttling
became permanently disabled.

Fix: clear self.active in recvDone so connectionLost passes active=False.
Guard isDone against Wait.remove on a proto that is no longer waiting.
When connection accounting is corrupted (NActive < 0), log a warning
and report zero rather than the raw negative value. Prevents alerting
rules like 'connections_active > connections_limit' from silently never
firing when the throttle has been bypassed.
Per-IOC locks let up to maxActive commits land in parallel. The
cleanOnStart sweep queried CF for active channels, then bulk-wrote
Inactive over all of them — racing against commits that had already
activated channels in the window between query and write.

Restores a single global DeferredLock to serialise all CF writes.
_ioc_channels (per-IOC channel set) is retained: without it a
disconnect extends records_to_delete with all known channels rather
than just the departing IOC's own.
Provides a safe manual alternative to cleanOnStart for sites that
disable automatic sweeping. Marks all Active channels for a given
recceiver_id Inactive. Supports --dry-run to preview the scope.

Usage: recceiver-clean -f recceiver.conf [--recceiver-id ID] [--dry-run]
…channel_is_old

If the IOC that last owned a channel has departed between the state
update and the CF push, look it up with .get() and fall back to
_orphan_channel rather than raising KeyError and silently dropping
the channel from the write batch. Same guard applied to the alias
path in the same function.
The commit path updates self.iocs and channel_ioc_ids before the CF
push. If the push exhausts push_max_retries (_push_to_cf returns False),
in-memory state says the IOC is committed but CF was never written.
The divergence persists until the IOC reconnects.

On retry exhaustion for a connected transaction, evict the IOC from all
in-memory tracking structures. The next commit from that IOC is treated
as an initial upload and re-registers all channels in CF.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 5, 2026

Quality Gate Passed Quality Gate passed

Issues
3 New issues
5 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Comment thread server/tests/unit/cf/test_processor.py Dismissed
Comment thread server/tests/unit/cf/test_processor.py Dismissed
Comment thread server/tests/unit/cf/test_processor.py Dismissed
Comment thread server/tests/unit/cf/test_processor.py Dismissed
Comment thread server/tests/unit/cf/test_processor.py Dismissed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants