Skip to content

AIP-103: Adding periodic task state garbage collection and retention support#66463

Open
amoghrajesh wants to merge 9 commits into
apache:mainfrom
astronomer:aip-103-4-garbage-collection-and-cleanup
Open

AIP-103: Adding periodic task state garbage collection and retention support#66463
amoghrajesh wants to merge 9 commits into
apache:mainfrom
astronomer:aip-103-4-garbage-collection-and-cleanup

Conversation

@amoghrajesh
Copy link
Copy Markdown
Contributor

@amoghrajesh amoghrajesh commented May 6, 2026

closes: #66459

What?

Task state rows live as long as their parent DAG run. In deployments that don't run airflow db cleanup — or where task state should expire sooner than the DAG run — rows accumulate indefinitely. This PR adds an explicit retention mechanism independent of DAG run cleanup. To perform effective cleanup, following is needed:

  1. Time based Garbage Collection: delete task_state rows older than N days
  2. Early expiry: per-key override for short-lived keys like job IDs that tasks can set per state row
  3. Asset state orphan cleanup: when an asset is removed from all DAGs its asset_active entry is deleted, but asset_state rows stay behind silently

Proposed change

  • expires_at column on task_state - updated_at alone can't distinguish a 7 day key from a 30 day key. NULL means fall back to the global default_retention_days; set means delete after this timestamp regardless of updated_at. Setting default_retention_days = 0 disables time-based cleanup entirely (expires_at cleanup still runs).
  • BaseStateBackend.cleanup() no-op default — custom backends override this to implement their own retention policy. The backend reads [state_store] default_retention_days from config itself since the AIP says "the backend is responsible for enforcing the retention policy."
  • New config options under [state_store]: default_retention_days = 30 (task_state only — does not affect asset_state) and clear_on_success = False.
  • MetastoreStateBackend.cleanup() runs two passes for task_state: rows past updated_at + default_retention_days cutoff, and rows with expires_at < now().
  • airflow state-store cleanup CLI command — calls get_state_backend().cleanup(). Operators schedule this via cron or a maintenance DAG. Supports --dry-run.
  • Asset state orphan cleanup moved into the scheduler's _update_asset_orphanage() — runs in the same pass as asset deregistration, which is when the orphans are created. This is the right home since it is an internal consistency operation, not a user-facing data lifecycle decision.

Why a CLI command instead of the scheduler?

Running cleanup as a scheduler periodic task was considered but there will be concerns regarding performance to the scheduler because cleanup doesn't come without a time cost.

A dedicated CLI keeps the separation clean, schedule it where it makes sense for a deployment.

User implications / backcompat

New config options under [state_store] with safe defaults — no action needed to maintain existing behaviour. The expires_at column is nullable; existing rows get NULL (global default retention applies).

Testing

Test setup

Ran a dag with single task instance and pushed 3 task states for it

image

Global Retention test

Run this query:

UPDATE task_state SET expires_at = '2026-04-06 00:00:00+00:00' WHERE key = 'job_id_2';
image

Run the state store cleanup:

[Breeze:3.10.20] root@c8ddefd92caa:/opt/airflow$ airflow state-store cleanup
2026-05-08T06:34:15.808202Z [info     ] setup plugin alembic.autogenerate.schemas [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808303Z [info     ] setup plugin alembic.autogenerate.tables [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808361Z [info     ] setup plugin alembic.autogenerate.types [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808403Z [info     ] setup plugin alembic.autogenerate.constraints [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808438Z [info     ] setup plugin alembic.autogenerate.defaults [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808480Z [info     ] setup plugin alembic.autogenerate.comments [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.862323Z [info     ] Running state store cleanup    [airflow.cli.commands.state_store_command] loc=state_store_command.py:49
2026-05-08T06:34:16.100725Z [info     ] Deleted expired task_state rows [airflow.state.metastore] loc=metastore.py:304 rows_deleted=1
image

What's next


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@boring-cyborg boring-cyborg Bot added area:ConfigTemplates area:db-migrations PRs with DB migration area:Scheduler including HA (high availability) scheduler labels May 6, 2026
@amoghrajesh amoghrajesh self-assigned this May 6, 2026
@amoghrajesh amoghrajesh moved this from Backlog to In progress in AIP-103: Task State Management May 6, 2026
@amoghrajesh amoghrajesh added this to the Airflow 3.3.0 milestone May 6, 2026
Comment thread airflow-core/src/airflow/config_templates/config.yml Outdated
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Copy link
Copy Markdown
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to introduce batching / pagination for the task state garbage collection?

Comment thread airflow-core/src/airflow/state/metastore.py Outdated
@amoghrajesh amoghrajesh force-pushed the aip-103-4-garbage-collection-and-cleanup branch from 082d92d to 7dc826d Compare May 7, 2026 12:24
@amoghrajesh amoghrajesh force-pushed the aip-103-4-garbage-collection-and-cleanup branch from 7dc826d to b644ce6 Compare May 7, 2026 12:29
Comment thread airflow-core/src/airflow/cli/cli_config.py Outdated
Comment thread airflow-core/src/airflow/state/metastore.py Outdated
Comment thread airflow-core/src/airflow/state/metastore.py Outdated
@amoghrajesh amoghrajesh requested review from Lee-W and ashb May 8, 2026 06:39
@amoghrajesh amoghrajesh requested a review from jason810496 May 8, 2026 06:39
@amoghrajesh amoghrajesh added the full tests needed We need to run full set of tests for this PR to merge label May 8, 2026
@amoghrajesh amoghrajesh closed this May 8, 2026
@github-project-automation github-project-automation Bot moved this from In progress to Done in AIP-103: Task State Management May 8, 2026
@amoghrajesh amoghrajesh reopened this May 8, 2026
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
break
return total

deleted = _delete_batched(TaskStateModel.expires_at < now)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it’d be a good idea if the actual expiration is calculated on the fly instead. If I’m understaing correctly, this currently relies on the expires_at column being correctly updated whenever update_at is updated (if the former is not set explicitly). This seems a bit fragile.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expires_at is set once at write time in every set() call and is never updated independently of the row — there's no dependency on updated_at being in sync with it. If you call set() again on the same key, the upsert recalculates and overwrites both updated_at and expires_at together atomically.

One legitimate edge case you may be pointing at: if a user starts with default_retention_days=0, then later raises it to 30 days, those old NULL rows won't be picked up by the current WHERE expires_at < now()pass. We can add a second pass WHEREexpires_atIS NULL ANDupdated_at < now - default_retention_days` for that case. How does that sound?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by a second pass? Where would this happen? (In abstract it sounds like a plan; it’s similar to how the next run needs to be recalculated when you change the dag schedule definition.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second pass would be something like WHERE expires_at IS NULL AND updated_at < now - default_retention_days, to catch rows that were written when default_retention_days=0 but the config was later raised.

Something like:

# Pass 1: code right now
deleted_expired = _delete_batched(TaskStateModel.expires_at < now)

# Pass 2: rows with NULL expires_at that are stale under the current global default
if default_retention_days > 0:
    cutoff = now - timedelta(days=default_retention_days)
    deleted_stale = _delete_batched(
        TaskStateModel.expires_at.is_(None) & (TaskStateModel.updated_at < cutoff)
    )

It would run in the same airflow state-store cleanup command

But on thinking more, I do not think that it is needed. expires_at=NULL is an explicit signal — either default_retention_days=0 was set, or retention_days=0 was passed at write time. Both mean "keep this row forever." Retroactively deleting them on a config change would violate what was promised at write time.

@amoghrajesh amoghrajesh force-pushed the aip-103-4-garbage-collection-and-cleanup branch from 28ea4fd to f52ce27 Compare May 11, 2026 08:22
Comment thread airflow-core/src/airflow/cli/commands/state_store_command.py Outdated
Comment thread airflow-core/src/airflow/cli/commands/state_store_command.py Outdated
Comment thread airflow-core/src/airflow/config_templates/config.yml Outdated
Comment thread airflow-core/src/airflow/config_templates/config.yml
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread airflow-core/src/airflow/state/metastore.py Outdated
Comment thread airflow-core/tests/unit/state/test_metastore.py
@amoghrajesh amoghrajesh requested review from Lee-W and uranusjr May 12, 2026 06:28
f" Dag {dag_id!r}, run {run_id!r}, task {task_id!r}, map_index {map_index!r}, key {key!r}"
)
else:
print("Custom backend configured — cannot preview rows.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or should we make _summary_dry_run_ part of the base state backend? if it's not implented then we should this message

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, no, this command makes sense for MetadataStateBackend only. The CLI is mostly used for that cleanup only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ConfigTemplates area:db-migrations PRs with DB migration area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Add periodic task state GC and expires_at retention support

5 participants