Skip to content

Export Partition - release the part lock when the query is cancelled#1593

Open
arthurpassos wants to merge 4 commits intoantalya-26.1from
fix_query_cancelled_not_releasing_the_lock
Open

Export Partition - release the part lock when the query is cancelled#1593
arthurpassos wants to merge 4 commits intoantalya-26.1from
fix_query_cancelled_not_releasing_the_lock

Conversation

@arthurpassos
Copy link
Copy Markdown
Collaborator

@arthurpassos arthurpassos commented Mar 27, 2026

During export partition, parts are locked by replicas for exports. This PR introduces a change that releases these locks when an export task is cancelled. Previously, it would not release the lock. We did not catch this error before because the only cases an export task was cancelled we tested were KILL EXPORT PARTITION and DROP TABLE. In those cases, the entire task is cancelled, so it does not matter if a replica does not release its lock.

But a query can also be cancelled with 'SYSTEM STOP MOVES', and in that case, it is a local operation. The lock must be released so other replicas can continue.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  1. During an export partition operation, release the part lock in case the export task has been cancelled.
  2. Do not run the scheduler if moves are stopped

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Workflow [PR], commit [022d159]

@arthurpassos arthurpassos added port-antalya PRs to be ported to all new Antalya releases antalya-26.1 labels Mar 27, 2026
@arthurpassos
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@ianton-ru ianton-ru Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: May be move this log record below exit without increment to avoid 'will now increment' - 'oh, sorry, will not'? I think second line would be enough to understand what is going on.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the "will now increment counters" part

ianton-ru
ianton-ru previously approved these changes Mar 30, 2026
@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Mar 30, 2026

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1593 (Export Partition - release part lock when query is cancelled):

Confirmed defects

Medium: Cancellation path ignores lock-release failure and can leave export stuck

Impact: A cancelled part export can remain locked by the same replica if ZooKeeper remove fails transiently; subsequent scheduler cycles skip the part as "locked", so export may remain pending until ZooKeeper session expiry/manual intervention.

Anchor: src/Storages/MergeTree/ExportPartitionTaskScheduler.cpp / ExportPartitionTaskScheduler::handlePartExportFailure.

Trigger: QUERY_WAS_CANCELLED while lock exists, combined with transient ZooKeeper failure (or version mismatch race) during tryRemove.

Affected transition: part task cancelled -> cancellation handler -> lock release -> part becomes schedulable.

Why defect: The new cancellation branch performs a best-effort tryRemove and returns unconditionally, without checking/remediating failure. This violates the lock-release invariant required for resumed scheduling after SYSTEM START MOVES.

Affected subsystem and blast radius: Replicated MergeTree export-partition scheduler state in ZooKeeper (exports/<key>/locks/<part>); impacts recovery/progress of pending partition exports on affected replica.

Smallest logical reproduction steps: Start EXPORT PARTITION, issue SYSTEM STOP MOVES, force cancellation callback, inject ZooKeeper remove failure for exports/<key>/locks/<part>, then SYSTEM START MOVES; observe part remains skipped due to stale lock.

Logical fault-injection mapping: Inject fault category "coordination write failure on cancellation transition" at zk->tryRemove(...locks/part...) in cancellation branch.

Fix direction (short): Check tryRemove result and handle failure explicitly (retry/backoff or fallback cleanup path), with observability and a re-drive mechanism.

Regression test direction (short): Add integration/fault-injection test that forces tryRemove failure on cancellation and verifies lock cleanup eventually occurs and export transitions to COMPLETED.

Code evidence:

if (exception->code() == ErrorCodes::QUERY_WAS_CANCELLED)
{
    zk->tryRemove(export_path / "locks" / part_name, locked_by_stat.version);
    LOG_INFO(storage.log, "ExportPartition scheduler task: Part {} export was cancelled, skipping error handling", part_name);
    return;
}

Coverage summary

Scope reviewed: Full PR delta for src/Storages/MergeTree/ExportPartitionTaskScheduler.cpp and tests/integration/test_export_replicated_mt_partition_to_object_storage/test.py, including cancellation/failure transitions, ZooKeeper lock lifecycle, scheduler gating under stopped moves, and resumed-export behavior.

Categories failed: Cancellation-transition durability under coordination failures (lock release failure handling).

Categories passed: Scheduler stop/start gating (available move executors path), lock ownership checks before mutation, non-cancellation failure retry/update flow, cancellation regression test intent for normal-path unlock/resume, concurrency review of lock-versioned remove path (no new deadlock/data-race confirmed).

Assumptions/limits: Static deep audit only (no runtime execution/CI); fault-injection outcomes were reasoned from code paths and ZooKeeper API contracts.

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Mar 30, 2026

Added a test that covers this for export partition: https://github.com/Altinity/clickhouse-regression/blob/6ca30d8082b91c4629f29d57cc917083d5b70f23/s3/tests/export_partition/network.py#L483

It is passing with this build locally:

Passing

✔ [ OK ] '/s3/minio/export tests/export partition/network/export resumes after stop start moves' (52s 109ms)
✔ [ OK ] '/s3/minio/export tests/export partition/network' (52s 116ms)
✔ [ OK ] '/s3/minio/export tests/export partition' (1m 36s)
✔ [ OK ] '/s3/minio/export tests' (1m 36s)
✔ [ OK ] '/s3/minio' (2m 25s)
✔ [ OK ] '/s3' (2m 25s)

Running as:

./regression.py --clickhouse docker://altinityinfra/clickhouse-server:1593-26.1.6.20001.altinityantalya --only "/s3/minio/export tests/export partition/network/*" --clickhouse-version 26.1.6.20001 -l test.log

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Mar 30, 2026

PR #1593 CI Triage — Export Partition: release the part lock when the query is cancelled

Field Value
PR #1593fix_query_cancelled_not_releasing_the_lockantalya-26.1
Author @arthurpassos
Commit 4fef9d7e7c353eac608c4520d0208647e789c536
Workflow Run 23748902706
CI Report ci_run_report.html
Date 2026-03-30
Labels port-antalya, antalya-26.1, antalya-26.1.6.20001
Merge Status MERGEABLE

PR Description

Bug fix for export partition: releases the part lock when an export task is cancelled (e.g. via SYSTEM STOP MOVES). Previously, the lock was not released on cancellation, preventing other replicas from continuing the export. Also gates the scheduler from running when moves are stopped.

Changed files (2):

  • src/Storages/MergeTree/ExportPartitionTaskScheduler.cpp
  • tests/integration/test_export_replicated_mt_partition_to_object_storage/test.py

Summary

Category Count Details
PR-caused regression 0
Pre-existing flaky 3 03608_export_merge_tree_part_filename_pattern, 04003_cast_nullable_read_in_order_explain, test_partition_timezone
Pre-existing known (tracked) 4 Swarms node failure tests (#1601)
Pre-existing known (XFail) 31 FULL OUTER JOIN swarm joins (ClickHouse#89996)
Docker CVE scan 1 CVE-2026-2673 (High) in alpine image

Verdict: No failures are caused by this PR. All CI failures are pre-existing or infrastructure-related.

PR-targeted regression suites all passed: s3_export_partition (x86_64 + aarch64), s3_export_part (x86_64 + aarch64), iceberg (x86_64 + aarch64), parquet (x86_64 + aarch64) — all green.


CI Jobs Status

Job Name Status Failure Type
Grype Scan (alpine image) ❌ failure Docker CVE — CVE-2026-2673 (High)
Integration tests (amd_asan, db disk, old analyzer, 3/6) ❌ failure Pre-existing flaky — Docker port collision
Regression aarch64 swarms ❌ failure Pre-existing — known 26.1 issue (#1601)
Regression release swarms ❌ failure Pre-existing — known 26.1 issue (#1601)
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel) ❌ failure Pre-existing flaky
Stateless tests (amd_debug, distributed plan, s3 storage, parallel) ❌ failure Pre-existing flaky
Regression aarch64 s3_export_partition success PR-targeted suite
Regression release s3_export_partition success PR-targeted suite
Regression aarch64 s3_export_part success PR-targeted suite
Regression release s3_export_part success PR-targeted suite
All other jobs (50+) ✅ success

Detailed Failure Analysis

1. 03608_export_merge_tree_part_filename_pattern — Pre-existing Flaky

2. 04003_cast_nullable_read_in_order_explain — Pre-existing Flaky

3. test_database_iceberg/test_partition_timezone.py::test_partition_timezone — Infrastructure / Flaky

4. Regression Swarms — Pre-existing Known Issue

Failing tests (both x86_64 and aarch64):

  • /swarms/feature/node failure/check restart clickhouse on swarm node — Error (ExpectTimeoutError 600s)
  • /swarms/feature/node failure/initiator out of disk space — Fail (Database datalakecatalog_db_... does not exist)

Overall swarms stats: 1520 scenarios (1487 ok, 1 failed, 1 errored, 31 xfail)

5. Grype Scan — Docker CVE


PR-Targeted Regression Suites

All regression suites specifically relevant to the export partition feature passed on both architectures:

Suite x86_64 aarch64
s3_export_partition ✅ 4 features, 7 scenarios — all OK (5m 18s) ✅ 4 features, 7 scenarios — all OK (4m 57s)
s3_export_part ✅ 17 features, 87 scenarios — all OK (1h 45m) ✅ 17 features, 87 scenarios — all OK (1h 44m)
iceberg_1 ✅ 42 features, 487 scenarios (1h 6m) ✅ 42 features, 487 scenarios (1h 24m)
iceberg_2 ✅ 29 features, 1576 scenarios (1h 21m) ✅ 29 features, 1576 scenarios (1h 41m)
parquet ✅ 39 features (57m) ✅ 39 features (1h 15m)

Additionally, the PR author confirmed local pass of the export partition network test (/s3/minio/export tests/export partition/network/export resumes after stop start moves).


Recommendations

  1. Merge-ready from CI perspective — No PR-caused regressions detected. All targeted regression suites pass.
  2. Pre-existing issues to track separately:
  3. Audit note from @Selfeer: An AI audit identified a medium-severity concern about cancellation path ignoring lock-release failure if ZooKeeper tryRemove fails transiently. This is a code-quality consideration, not a CI regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya-26.1 antalya-26.1.6.20001 port-antalya PRs to be ported to all new Antalya releases verified-with-issue Verified by QA and issue(s) found.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants