Skip to content

Fix memory leak in LocalExecutor caused by unreleased file descriptor locks#65121

Open
wjddn279 wants to merge 4 commits into
apache:mainfrom
wjddn279:fix-memory-leak-in-log_file_descriptor
Open

Fix memory leak in LocalExecutor caused by unreleased file descriptor locks#65121
wjddn279 wants to merge 4 commits into
apache:mainfrom
wjddn279:fix-memory-leak-in-log_file_descriptor

Conversation

@wjddn279
Copy link
Copy Markdown
Contributor

@wjddn279 wjddn279 commented Apr 13, 2026

Problem?

We conducted an investigation to resolve the continuous memory growth in the Local Executor's forked processes, following the previous memory spike fix.

Using memray to profile the Local Executor's forked processes, we confirmed that memory was steadily increasing. Specifically, logging-related memory was growing, and this was traced to the process of writing to local files.
memray-flamegraph-output-2026-04-11 13:17:18.516888.html

Cause?

Looking at, full_path.touch(new_file_permissions) conf.get("logging", "file_task_handler_new_file_permissions", fallback="0o664") in flamegraph, I noticed that basic string objects were not being properly garbage collected, which led us to suspect unreleased references.

Upon examining self._lock = _get_lock_for_file(self._file) in the memory-increasing area, I found that file descriptors were being cached as locks in a dictionary with no corresponding release mechanism. This prevented the file descriptors from being properly released, and consequently, the parent Process objects also retained references and were never garbage collected.

Solution

After adding code to manually release these references, we re-ran the profiling and confirmed that the related objects were being properly freed.
memray-flamegraph-output-2026-04-12 13:14:12.556546.html

PS

The remained parsed = [sys.intern(str(x)) for x in rel.split(sep) if x and x != '.']issue will be addressed in a separate PR.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@wjddn279 wjddn279 force-pushed the fix-memory-leak-in-log_file_descriptor branch from 6f0717f to ad37cd0 Compare April 13, 2026 08:17
@eladkal eladkal added this to the Airflow 3.2.1 milestone Apr 13, 2026
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Apr 13, 2026
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
@potiuk potiuk marked this pull request as draft April 23, 2026 00:43
@potiuk

This comment was marked as outdated.

@wjddn279 wjddn279 force-pushed the fix-memory-leak-in-log_file_descriptor branch from ad37cd0 to 5a533f4 Compare April 23, 2026 04:18
@wjddn279 wjddn279 marked this pull request as ready for review April 23, 2026 04:26
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 27, 2026

@wjddn279 — There are 1 unresolved review thread on this PR from @eladkal. Could you either push a fix or reply in each thread explaining why the feedback doesn't apply? Once you believe the feedback is addressed, mark the thread as resolved so the reviewer isn't re-pinged needlessly. Thanks!


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@wjddn279 wjddn279 force-pushed the fix-memory-leak-in-log_file_descriptor branch 2 times, most recently from cb82a2e to 2215c99 Compare April 30, 2026 06:23
@wjddn279 wjddn279 requested a review from potiuk as a code owner April 30, 2026 06:23
@ashb ashb changed the title Fix memory leak in forked worker caused by unreleased file descriptor locks Fix memory leak in LocalExecutor caused by unreleased file descriptor locks Apr 30, 2026
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Apr 30, 2026
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 30, 2026

@wjddn279 Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.

This behaviour was wrong and it's been fixed in #65916

Sorry for the noise @wjddn279 !

@uranusjr
Copy link
Copy Markdown
Member

Weird we’d need to do this. Can this be considered a bug in structlog?

@wjddn279
Copy link
Copy Markdown
Contributor Author

wjddn279 commented May 1, 2026

I'm currently checking with the structlog maintainers, but I'm not sure whether I'll get a response

@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented May 1, 2026

I'm currently checking with the structlog maintainers, but I'm not sure whether I'll get a response

Can you share the issue opened reporting the bug?

@wjddn279
Copy link
Copy Markdown
Contributor Author

wjddn279 commented May 1, 2026

@eladkal

not issue but pr thread i've opended

hynek/structlog#806 (comment)

@ashb
Copy link
Copy Markdown
Member

ashb commented May 1, 2026

I think I've got a better fix hynek/structlog#807

@wjddn279
Copy link
Copy Markdown
Contributor Author

wjddn279 commented May 3, 2026

@ashb

The PR has been merged! I checked and the release schedule including that version doesn't seem to be confirmed yet. How about we use this PR for now and remove it later when we bump the version after the release?

@wjddn279 wjddn279 force-pushed the fix-memory-leak-in-log_file_descriptor branch 2 times, most recently from 0348613 to 30aaf78 Compare May 6, 2026 09:06
@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented May 6, 2026

The PR has been merged! I checked and the release schedule including that version doesn't seem to be confirmed yet. How about we use this PR for now and remove it later when we bump the version after the release?

Generally speaking, we can backport/vendor the fix right? and in parallal have a draft PR that removes the backport with updating the libary version that we will merge when it's released.
but lets see what @ashb think about that.

@ashb
Copy link
Copy Markdown
Member

ashb commented May 6, 2026

Hynek is often prompt about releasing new versions, but if we don't want to wait, then backporting this is a one line change -- as long as we apply it "safely" (so the it doesn't crash if something goes wrong, sure.

@wjddn279 wjddn279 force-pushed the fix-memory-leak-in-log_file_descriptor branch 2 times, most recently from b27db45 to 0a08921 Compare May 7, 2026 02:43
@wjddn279
Copy link
Copy Markdown
Contributor Author

@eladkal @ashb

Sounds good. It would be great to get this into 3.2.2. Once that's in, I'll open a draft PR right away to remove this workaround and bump the upper bound on structlog.

@potiuk potiuk added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label May 12, 2026
Copy link
Copy Markdown
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one minor docstring nit. The fix is a targeted workaround for a real production memory leak that the author traced with memray, and the prior round of review asks from @kaxil are both addressed in shared/logging/src/airflow_shared/logging/structlog.py:806-812:

  • WRITE_LOCKS.pop(log_file_descriptor, None) — no KeyError shadowing exceptions in the finally clause that calls it.
  • try: from structlog._output import WRITE_LOCKS … except ImportError: WRITE_LOCKS = None — degrades gracefully if a future structlog version moves or removes the internal.

Both supervisor sites that close() a log_file_descriptor (task-sdk/.../supervisor.py:2262 and task-sdk/.../callback_supervisor.py:381) now call the helper. I grepped for any third caller and didn't find one — coverage looks complete.

@ashb's earlier concern about "papering over the problem" is valid as a long-term position, but the author's reply on 2026-04-23 makes the situation clear: every Python-side close/delete approach tried leaves the reference held inside structlog's global dict, and the upstream fix (hynek/structlog#806) hasn't merged. The memray flamegraph in the PR description shows the leak is real and accumulates across LocalExecutor-forked processes. Workaround-with-graceful-fallback feels like the right call.

Smaller observation

  • shared/logging/src/airflow_shared/logging/structlog.py:806-812clear_structlog_shared_lock has no docstring. Given that the function exists specifically to touch a private upstream attribute, the why belongs in the source rather than only the PR body. A two-line docstring noting (a) structlog caches file→lock without release, and (b) tracking issue https://github.com/hynek/structlog/pull/806, would help the next maintainer who finds this in a year. Not blocking the merge, but worth tacking on.

    Optional follow-up: move the try: from structlog._output import WRITE_LOCKS out of the function body to module scope so it's not re-attempted on every call. Pure micro-perf and module-load-order hygiene, not a defect.

Approving — thanks for the long debugging session that produced this.


This review was drafted by an AI-assisted tool and confirmed by an Apache Airflow maintainer. The maintainer approving this PR has read the findings and signed off. If something feels off, please reply on the PR and a maintainer will follow up.

More on how Apache Airflow handles maintainer review: contributing-docs/05_pull_requests.rst.

@wjddn279 wjddn279 force-pushed the fix-memory-leak-in-log_file_descriptor branch from 0a08921 to a16f701 Compare May 12, 2026 01:08
except ImportError:
WRITE_LOCKS = None # type: ignore[assignment]

if WRITE_LOCKS is not None and isinstance(WRITE_LOCKS, dict):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In structlog 26.1.0, WRITE_LOCKS was changed to a weakref.WeakKeyDictionary, which removes keys automatically. To avoid interfering with that behavior, this guard only deletes the entry directly when WRITE_LOCKS is still a regular dict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:task-sdk backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch ready for maintainer review Set after triaging when all criteria pass. type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants