Fix on_up() destroying healthy pool after replace-with-same-IP#879
Fix on_up() destroying healthy pool after replace-with-same-IP#879bitpathfinder wants to merge 1 commit into
Conversation
c1f7d6b to
582e86d
Compare
There was a problem hiding this comment.
Pull request overview
Addresses SCYLLADB-833 where Cluster.on_up() can be invoked with a stale Host instance after a replace-with-same-IP, causing the driver to tear down a healthy pool and briefly fail queries.
Changes:
- Add early-return guards in
Cluster.on_up()to skip handling when theHostreference is stale or when a healthy pool already exists. - Add unit tests covering stale-host and healthy-pool scenarios to prevent regressions.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| cassandra/cluster.py | Adds on_up() guard clauses to avoid tearing down an already-healthy pool when handling stale host references. |
| tests/unit/test_cluster.py | Adds unit tests validating the new on_up() early-return behavior for stale hosts and pre-existing healthy pools. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
582e86d to
c9b8e5e
Compare
|
Blocking: the healthy-pool early return at python-driver/cassandra/cluster.py Line 1944 in c9b8e5e on_up() handler as soon as any session has a live pool, but on_up() still needs to do per-session reconciliation. In a multi-session cluster, later sessions never get their remove/rebuild or update_created_pools() pass, so a stale UP event can leave the driver half-up.
The new tests only cover a single mock session at python-driver/tests/unit/test_cluster.py Line 784 in c9b8e5e on_up() bookkeeping running.
|
When a node is replaced with the same IP, the driver receives both TOPOLOGY_CHANGE NEW_NODE and STATUS_CHANGE UP events. The NEW_NODE handler runs first, replacing the old host and establishing a new pool. The STATUS_CHANGE UP handler fires later with a stale reference to the old host object. Because Host.__eq__/__hash__ are endpoint-based, the stale on_up() tears down the new host's pool, causing a brief window where queries fail with NoHostAvailable. Add two guards at the top of on_up(): 1. If the host has been replaced in metadata (different object, same endpoint, new host already up), skip processing. 2. If all sessions already have a healthy (non-shutdown) pool for this host, call set_up() and skip the teardown/rebuild cycle. Both guards reset _currently_handling_node_up under host.lock and use host.set_up() (which resets the conviction policy), consistent with the existing cleanup paths. Refs: SCYLLADB-833
c9b8e5e to
3df9e28
Compare
|
@bitpathfinder - are all review comments replied? Can it be re-reviewed? |
Summary
When a node is replaced with the same IP, the driver receives both
TOPOLOGY_CHANGE NEW_NODEandSTATUS_CHANGE UPevents. TheNEW_NODEhandler runs first, replacing the old host and establishing a new pool viaon_add. TheSTATUS_CHANGE UPhandler fires later (after a random 0-2s delay) with a stale reference to the old host object. BecauseHost.__eq__/__hash__are endpoint-based, the staleon_up()tears down the new host's pool, causing a brief window where queries fail withNoHostAvailable.Fix
Two early-return guards added at the top of
Cluster.on_up():on_add), mark the host as up and skip the teardown/rebuild cycle.Fixes: SCYLLADB-833