Fix memory leak in LocalExecutor caused by unreleased file descriptor locks#65121
Fix memory leak in LocalExecutor caused by unreleased file descriptor locks#65121wjddn279 wants to merge 4 commits into
Conversation
6f0717f to
ad37cd0
Compare
This comment was marked as outdated.
This comment was marked as outdated.
ad37cd0 to
5a533f4
Compare
|
@wjddn279 — There are 1 unresolved review thread on this PR from @eladkal. Could you either push a fix or reply in each thread explaining why the feedback doesn't apply? Once you believe the feedback is addressed, mark the thread as resolved so the reviewer isn't re-pinged needlessly. Thanks! Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
cb82a2e to
2215c99
Compare
This behaviour was wrong and it's been fixed in #65916 Sorry for the noise @wjddn279 ! |
|
Weird we’d need to do this. Can this be considered a bug in structlog? |
|
I'm currently checking with the structlog maintainers, but I'm not sure whether I'll get a response |
Can you share the issue opened reporting the bug? |
|
not issue but pr thread i've opended |
|
I think I've got a better fix hynek/structlog#807 |
|
The PR has been merged! I checked and the release schedule including that version doesn't seem to be confirmed yet. How about we use this PR for now and remove it later when we bump the version after the release? |
0348613 to
30aaf78
Compare
Generally speaking, we can backport/vendor the fix right? and in parallal have a draft PR that removes the backport with updating the libary version that we will merge when it's released. |
|
Hynek is often prompt about releasing new versions, but if we don't want to wait, then backporting this is a one line change -- as long as we apply it "safely" (so the it doesn't crash if something goes wrong, sure. |
b27db45 to
0a08921
Compare
potiuk
left a comment
There was a problem hiding this comment.
LGTM with one minor docstring nit. The fix is a targeted workaround for a real production memory leak that the author traced with memray, and the prior round of review asks from @kaxil are both addressed in shared/logging/src/airflow_shared/logging/structlog.py:806-812:
WRITE_LOCKS.pop(log_file_descriptor, None)— noKeyErrorshadowing exceptions in thefinallyclause that calls it.try: from structlog._output import WRITE_LOCKS … except ImportError: WRITE_LOCKS = None— degrades gracefully if a future structlog version moves or removes the internal.
Both supervisor sites that close() a log_file_descriptor (task-sdk/.../supervisor.py:2262 and task-sdk/.../callback_supervisor.py:381) now call the helper. I grepped for any third caller and didn't find one — coverage looks complete.
@ashb's earlier concern about "papering over the problem" is valid as a long-term position, but the author's reply on 2026-04-23 makes the situation clear: every Python-side close/delete approach tried leaves the reference held inside structlog's global dict, and the upstream fix (hynek/structlog#806) hasn't merged. The memray flamegraph in the PR description shows the leak is real and accumulates across LocalExecutor-forked processes. Workaround-with-graceful-fallback feels like the right call.
Smaller observation
-
shared/logging/src/airflow_shared/logging/structlog.py:806-812—clear_structlog_shared_lockhas no docstring. Given that the function exists specifically to touch a private upstream attribute, the why belongs in the source rather than only the PR body. A two-line docstring noting (a) structlog caches file→lock without release, and (b) tracking issuehttps://github.com/hynek/structlog/pull/806, would help the next maintainer who finds this in a year. Not blocking the merge, but worth tacking on.Optional follow-up: move the
try: from structlog._output import WRITE_LOCKSout of the function body to module scope so it's not re-attempted on every call. Pure micro-perf and module-load-order hygiene, not a defect.
Approving — thanks for the long debugging session that produced this.
This review was drafted by an AI-assisted tool and confirmed by an Apache Airflow maintainer. The maintainer approving this PR has read the findings and signed off. If something feels off, please reply on the PR and a maintainer will follow up.
More on how Apache Airflow handles maintainer review: contributing-docs/05_pull_requests.rst.
0a08921 to
a16f701
Compare
| except ImportError: | ||
| WRITE_LOCKS = None # type: ignore[assignment] | ||
|
|
||
| if WRITE_LOCKS is not None and isinstance(WRITE_LOCKS, dict): |
There was a problem hiding this comment.
In structlog 26.1.0, WRITE_LOCKS was changed to a weakref.WeakKeyDictionary, which removes keys automatically. To avoid interfering with that behavior, this guard only deletes the entry directly when WRITE_LOCKS is still a regular dict.
Problem?
We conducted an investigation to resolve the continuous memory growth in the Local Executor's forked processes, following the previous memory spike fix.
Using memray to profile the Local Executor's forked processes, we confirmed that memory was steadily increasing. Specifically, logging-related memory was growing, and this was traced to the process of writing to local files.
memray-flamegraph-output-2026-04-11 13:17:18.516888.html
Cause?
Looking at,
full_path.touch(new_file_permissions)conf.get("logging", "file_task_handler_new_file_permissions", fallback="0o664")in flamegraph, I noticed that basic string objects were not being properly garbage collected, which led us to suspect unreleased references.Upon examining
self._lock = _get_lock_for_file(self._file)in the memory-increasing area, I found that file descriptors were being cached as locks in a dictionary with no corresponding release mechanism. This prevented the file descriptors from being properly released, and consequently, the parent Process objects also retained references and were never garbage collected.Solution
After adding code to manually release these references, we re-ran the profiling and confirmed that the related objects were being properly freed.
memray-flamegraph-output-2026-04-12 13:14:12.556546.html
PS
The remained
parsed = [sys.intern(str(x)) for x in rel.split(sep) if x and x != '.']issue will be addressed in a separate PR.Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.