Fix on_up() destroying healthy pool after replace-with-same-IP by bitpathfinder · Pull Request #879 · scylladb/python-driver

bitpathfinder · 2026-05-07T16:30:24Z

Summary

When a node is replaced with the same IP, the driver receives both TOPOLOGY_CHANGE NEW_NODE and STATUS_CHANGE UP events. The NEW_NODE handler runs first, replacing the old host and establishing a new pool via on_add. The STATUS_CHANGE UP handler fires later (after a random 0-2s delay) with a stale reference to the old host object. Because Host.__eq__/__hash__ are endpoint-based, the stale on_up() tears down the new host's pool, causing a brief window where queries fail with NoHostAvailable.

Fix

Two early-return guards added at the top of Cluster.on_up():

Stale host check: If the host object has been replaced in metadata (different identity, same endpoint) and the new host is already up, skip processing.
Healthy pool check: If a non-shutdown pool already exists for this host (established by on_add), mark the host as up and skip the teardown/rebuild cycle.

Fixes: SCYLLADB-833

Copilot

Pull request overview

Addresses SCYLLADB-833 where Cluster.on_up() can be invoked with a stale Host instance after a replace-with-same-IP, causing the driver to tear down a healthy pool and briefly fail queries.

Changes:

Add early-return guards in Cluster.on_up() to skip handling when the Host reference is stale or when a healthy pool already exists.
Add unit tests covering stale-host and healthy-pool scenarios to prevent regressions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
cassandra/cluster.py	Adds `on_up()` guard clauses to avoid tearing down an already-healthy pool when handling stale host references.
tests/unit/test_cluster.py	Adds unit tests validating the new `on_up()` early-return behavior for stale hosts and pre-existing healthy pools.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dkropachev · 2026-05-19T18:09:48Z

Blocking: the healthy-pool early return at

python-driver/cassandra/cluster.py

Line 1944 in c9b8e5e

for session in tuple(self.sessions):

is too broad. It exits the entire on_up() handler as soon as any session has a live pool, but on_up() still needs to do per-session reconciliation. In a multi-session cluster, later sessions never get their remove/rebuild or update_created_pools() pass, so a stale UP event can leave the driver half-up.

The new tests only cover a single mock session at

python-driver/tests/unit/test_cluster.py

Line 784 in c9b8e5e

cluster = self._make_cluster(sessions={mock_session})

, so this case stays untested. I’d narrow the skip to the session that already has the healthy pool, or keep the rest of on_up() bookkeeping running.

When a node is replaced with the same IP, the driver receives both TOPOLOGY_CHANGE NEW_NODE and STATUS_CHANGE UP events. The NEW_NODE handler runs first, replacing the old host and establishing a new pool. The STATUS_CHANGE UP handler fires later with a stale reference to the old host object. Because Host.__eq__/__hash__ are endpoint-based, the stale on_up() tears down the new host's pool, causing a brief window where queries fail with NoHostAvailable. Add two guards at the top of on_up(): 1. If the host has been replaced in metadata (different object, same endpoint, new host already up), skip processing. 2. If all sessions already have a healthy (non-shutdown) pool for this host, call set_up() and skip the teardown/rebuild cycle. Both guards reset _currently_handling_node_up under host.lock and use host.set_up() (which resets the conviction policy), consistent with the existing cleanup paths. Refs: SCYLLADB-833

mykaul · 2026-06-01T11:25:38Z

@bitpathfinder - are all review comments replied? Can it be re-reviewed?

bitpathfinder requested review from Copilot and dkropachev and removed request for dkropachev May 7, 2026 16:56

github-actions Bot added P1 symptom/ci_stability labels May 7, 2026

Copilot started reviewing on behalf of bitpathfinder May 7, 2026 16:57 View session

bitpathfinder changed the title ~~Fix on_up() destroying healthy pool after replace-with-same-IP (SCYLLADB-833)~~ Fix on_up() destroying healthy pool after replace-with-same-IP May 7, 2026

bitpathfinder force-pushed the fix/scylladb-833-on-up-stale-host branch from c1f7d6b to 582e86d Compare May 7, 2026 17:00

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread cassandra/cluster.py

Comment thread cassandra/cluster.py Outdated

bitpathfinder force-pushed the fix/scylladb-833-on-up-stale-host branch from 582e86d to c9b8e5e Compare May 7, 2026 17:05

bitpathfinder self-assigned this May 7, 2026

bitpathfinder marked this pull request as ready for review May 7, 2026 17:16

bitpathfinder marked this pull request as draft May 7, 2026 17:54

bitpathfinder marked this pull request as ready for review May 11, 2026 11:05

bitpathfinder requested review from Lorak-mmk and dkropachev May 11, 2026 11:26

bitpathfinder force-pushed the fix/scylladb-833-on-up-stale-host branch from c9b8e5e to 3df9e28 Compare May 22, 2026 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix on_up() destroying healthy pool after replace-with-same-IP#879

Fix on_up() destroying healthy pool after replace-with-same-IP#879
bitpathfinder wants to merge 1 commit into
scylladb:masterfrom
bitpathfinder:fix/scylladb-833-on-up-stale-host

bitpathfinder commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

dkropachev commented May 19, 2026

Uh oh!

mykaul commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bitpathfinder commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

dkropachev commented May 19, 2026

Uh oh!

mykaul commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bitpathfinder commented May 7, 2026 •

edited

Loading