AIP-103: Adding periodic task state garbage collection and retention support#66463
AIP-103: Adding periodic task state garbage collection and retention support#66463amoghrajesh wants to merge 9 commits into
Conversation
jason810496
left a comment
There was a problem hiding this comment.
Would it be better to introduce batching / pagination for the task state garbage collection?
082d92d to
7dc826d
Compare
7dc826d to
b644ce6
Compare
| break | ||
| return total | ||
|
|
||
| deleted = _delete_batched(TaskStateModel.expires_at < now) |
There was a problem hiding this comment.
I wonder if it’d be a good idea if the actual expiration is calculated on the fly instead. If I’m understaing correctly, this currently relies on the expires_at column being correctly updated whenever update_at is updated (if the former is not set explicitly). This seems a bit fragile.
There was a problem hiding this comment.
expires_at is set once at write time in every set() call and is never updated independently of the row — there's no dependency on updated_at being in sync with it. If you call set() again on the same key, the upsert recalculates and overwrites both updated_at and expires_at together atomically.
One legitimate edge case you may be pointing at: if a user starts with default_retention_days=0, then later raises it to 30 days, those old NULL rows won't be picked up by the current WHERE expires_at < now()pass. We can add a second pass WHEREexpires_atIS NULL ANDupdated_at < now - default_retention_days` for that case. How does that sound?
There was a problem hiding this comment.
What do you mean by a second pass? Where would this happen? (In abstract it sounds like a plan; it’s similar to how the next run needs to be recalculated when you change the dag schedule definition.)
There was a problem hiding this comment.
A second pass would be something like WHERE expires_at IS NULL AND updated_at < now - default_retention_days, to catch rows that were written when default_retention_days=0 but the config was later raised.
Something like:
# Pass 1: code right now
deleted_expired = _delete_batched(TaskStateModel.expires_at < now)
# Pass 2: rows with NULL expires_at that are stale under the current global default
if default_retention_days > 0:
cutoff = now - timedelta(days=default_retention_days)
deleted_stale = _delete_batched(
TaskStateModel.expires_at.is_(None) & (TaskStateModel.updated_at < cutoff)
)It would run in the same airflow state-store cleanup command
But on thinking more, I do not think that it is needed. expires_at=NULL is an explicit signal — either default_retention_days=0 was set, or retention_days=0 was passed at write time. Both mean "keep this row forever." Retroactively deleting them on a config change would violate what was promised at write time.
28ea4fd to
f52ce27
Compare
Co-authored-by: Wei Lee <weilee.rx@gmail.com>
| f" Dag {dag_id!r}, run {run_id!r}, task {task_id!r}, map_index {map_index!r}, key {key!r}" | ||
| ) | ||
| else: | ||
| print("Custom backend configured — cannot preview rows.") |
There was a problem hiding this comment.
or should we make _summary_dry_run_ part of the base state backend? if it's not implented then we should this message
There was a problem hiding this comment.
Not really, no, this command makes sense for MetadataStateBackend only. The CLI is mostly used for that cleanup only
closes: #66459
What?
Task state rows live as long as their parent DAG run. In deployments that don't run airflow db cleanup — or where task state should expire sooner than the DAG run — rows accumulate indefinitely. This PR adds an explicit retention mechanism independent of DAG run cleanup. To perform effective cleanup, following is needed:
task_staterows older than N daysasset_activeentry is deleted, butasset_staterows stay behind silentlyProposed change
expires_atcolumn ontask_state-updated_atalone can't distinguish a 7 day key from a 30 day key. NULL means fall back to the globaldefault_retention_days; set means delete after this timestamp regardless ofupdated_at. Settingdefault_retention_days = 0disables time-based cleanup entirely (expires_atcleanup still runs).BaseStateBackend.cleanup()no-op default — custom backends override this to implement their own retention policy. The backend reads[state_store] default_retention_daysfrom config itself since the AIP says "the backend is responsible for enforcing the retention policy."[state_store]:default_retention_days = 30(task_state only — does not affect asset_state) andclear_on_success = False.MetastoreStateBackend.cleanup()runs two passes for task_state: rows pastupdated_at + default_retention_dayscutoff, and rows withexpires_at < now().airflow state-store cleanupCLI command — callsget_state_backend().cleanup(). Operators schedule this via cron or a maintenance DAG. Supports--dry-run._update_asset_orphanage()— runs in the same pass as asset deregistration, which is when the orphans are created. This is the right home since it is an internal consistency operation, not a user-facing data lifecycle decision.Why a CLI command instead of the scheduler?
Running cleanup as a scheduler periodic task was considered but there will be concerns regarding performance to the scheduler because cleanup doesn't come without a time cost.
A dedicated CLI keeps the separation clean, schedule it where it makes sense for a deployment.
User implications / backcompat
New config options under
[state_store]with safe defaults — no action needed to maintain existing behaviour. Theexpires_atcolumn is nullable; existing rows getNULL(global default retention applies).Testing
Test setup
Ran a dag with single task instance and pushed 3 task states for it
Global Retention test
Run this query:
Run the state store cleanup:
What's next
clear_on_successhook: Clear task state on TI success #66460task_state.set(retention_days=N)API to populateexpires_atat write time: Add ability for Per task state key retention at operator level #66461Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.