Fix/1.9.6 prod observations#168
Open
anderslindho wants to merge 7 commits into
Open
Conversation
recvDone called isDone(active=True) to free the connection slot, but never cleared self.active. connectionLost then called isDone(active=True) again, causing a second decrement or waiter promotion per completed upload. After N uploads NActive drifted to -N, maxActive throttling became permanently disabled. Fix: clear self.active in recvDone so connectionLost passes active=False. Guard isDone against Wait.remove on a proto that is no longer waiting.
When connection accounting is corrupted (NActive < 0), log a warning and report zero rather than the raw negative value. Prevents alerting rules like 'connections_active > connections_limit' from silently never firing when the throttle has been bypassed.
Per-IOC locks let up to maxActive commits land in parallel. The cleanOnStart sweep queried CF for active channels, then bulk-wrote Inactive over all of them — racing against commits that had already activated channels in the window between query and write. Restores a single global DeferredLock to serialise all CF writes. _ioc_channels (per-IOC channel set) is retained: without it a disconnect extends records_to_delete with all known channels rather than just the departing IOC's own.
Provides a safe manual alternative to cleanOnStart for sites that disable automatic sweeping. Marks all Active channels for a given recceiver_id Inactive. Supports --dry-run to preview the scope. Usage: recceiver-clean -f recceiver.conf [--recceiver-id ID] [--dry-run]
…channel_is_old If the IOC that last owned a channel has departed between the state update and the CF push, look it up with .get() and fall back to _orphan_channel rather than raising KeyError and silently dropping the channel from the write batch. Same guard applied to the alias path in the same function.
The commit path updates self.iocs and channel_ioc_ids before the CF push. If the push exhausts push_max_retries (_push_to_cf returns False), in-memory state says the IOC is committed but CF was never written. The divergence persists until the IOC reconnects. On retry exhaustion for a connected transaction, evict the IOC from all in-memory tracking structures. The next commit from that IOC is treated as an initial upload and re-registers all channels in CF.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

This MR fixes a bunch of issues seen in production at ESS when deploying recCeiver 1.9.6 (and CF 5.1.0) with cleanOnStart active and freshly wiped DB.
known_iocsca 100 during incident; after manual restart climbed to expected level then crashedFrom prometheus exporter:
Also in logs:
Note that this MR adds a recceiver-clean utility. This is because we at ESS have decided to not use cleanOn* anymore - this is not, and never was, recCeiver's scope. It was a band-aid which we do not want to attempt fixing any further. We will instead use the utility as needed, and try to integrate better mechanisms in a future CF version or any potential CF replacement.