fix serialize_template_field handling callable value in dict#63871
fix serialize_template_field handling callable value in dict#63871wjddn279 wants to merge 3 commits into
Conversation
4a10da9 to
6f6156f
Compare
There was a problem hiding this comment.
Pull request overview
Fixes non-deterministic DAG serialization when templated dict fields contain callable values (e.g., lambdas), preventing serialized output from changing on every run due to function object addresses.
Changes:
- Sanitizes callable values during recursive dict sorting in
serialize_template_field. - Refactors callable stringification into a helper for consistent formatting.
- Adds unit tests intended to validate stable serialization when dicts contain callables.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| airflow-core/src/airflow/serialization/helpers.py | Makes callable values in dicts serialize to a stable string during recursive sorting. |
| airflow-core/tests/unit/serialization/test_helpers.py | Adds regression tests around dicts containing callables to prevent unstable serialization. |
7c4b179 to
8eb05f8
Compare
e339121 to
6771c87
Compare
6771c87 to
aebce69
Compare
ephraimbuddy
left a comment
There was a problem hiding this comment.
Thanks — root-cause analysis is right and the <callable …> token is the correct deterministic fix. One blocker, plus a few nits.
Blocker: sorted(obj.items()) regresses mixed-key dicts.
The new dict branch in not is_jsonable(...) calls sort_and_make_static_dict_recursively, which does sorted(obj.items()) at helpers.py:73. A non-jsonable dict with mixed key types like {1: "a", "b": lambda: None} previously fell through to str(template_field) and serialized fine; now it raises TypeError: '<' not supported between instances of 'str' and 'int', which propagates (the try only catches AttributeError) and breaks DAG serialization.
>>> serialize_template_field({1: "a", "b": lambda: None}, "op_kwargs")
TypeError: '<' not supported between instances of 'str' and 'int'The pre-existing sort_dict_recursively (now line 107) has the same bug today for jsonable mixed-key dicts — {1: "a", "b": "c"} already raises on main. One fix can cover both call sites:
sorted(obj.items(), key=lambda kv: (type(kv[0]).__name__, repr(kv[0])))This keeps the determinism the sort_* helpers exist for, without crashing on heterogeneous keys. A try/except TypeError fallback to unsorted iteration would also work but loses the consistency guarantee. Please add a regression test covering the mixed-key + callable case.
4d13583 to
6b6ecbd
Compare
amoghrajesh
left a comment
There was a problem hiding this comment.
Sorry for the delayed review, some comments.
|
I've written logic that handles those issues all at once. (with #65705). Could you take another look? |
|
ping @ephraimbuddy ! |
This now regressed well from earlier. |
5640041 to
6f46715
Compare
| pytest.param({}, {}, id="empty_dict"), | ||
| pytest.param((), [], id="empty_tuple"), | ||
| pytest.param(set(), "set()", id="empty_set"), | ||
| pytest.param(set(), [], id="empty_set"), |
There was a problem hiding this comment.
Update tests to reflect logic change: set and frozenset are now converted to list instead of being string-cast.
related https://github.com/apache/airflow/pull/63871/changes#r3212530026
| pytest.param({"foo": "bar"}, {"foo": "bar"}, id="dict"), | ||
| pytest.param(("foo", "bar"), ["foo", "bar"], id="tuple"), | ||
| pytest.param({"foo"}, "{'foo'}", id="set"), | ||
| pytest.param({"foo"}, ["foo"], id="set"), |
There was a problem hiding this comment.
| pytest.param( | ||
| {"my_tup": (1, 2), "my_set": {1, 2, 3}}, | ||
| {"my_tup": [1, 2], "my_set": "{1, 2, 3}"}, | ||
| {"my_tup": [1, 2], "my_set": [1, 2, 3]}, |
There was a problem hiding this comment.
| if isinstance(obj, (list, tuple)): | ||
| return [serialize_object(item) for item in obj] | ||
|
|
||
| if isinstance(obj, (set, frozenset)): |
There was a problem hiding this comment.
Previously set/frozenset values were converted via str(), which leaked memory addresses for custom objects and destabilized the DAG hash on every parse. The new logic recursively serializes each element and returns a sorted list, producing deterministic, JSON-encodable output.
ephraimbuddy
left a comment
There was a problem hiding this comment.
We also need to align this with the same function in task runner
6f46715 to
1c08c9c
Compare
1c08c9c to
8a4112c
Compare
| # Serialize keys/values first so each key is a string and the output is hash-stable, | ||
| # then sort by the serialized key to prevents hash inconsistencies when dict ordering varies. | ||
| serialized_pairs = [(normalize_dict_key(k), serialize_object(v)) for k, v in obj.items()] | ||
| return dict(sorted(serialized_pairs, key=lambda kv: kv[0])) |
There was a problem hiding this comment.
Since keys are now fixed as strings by normalize_dict_key, the logic that sorted by key type alongside the key value has been removed. Sorting is now performed solely by the key value.
closed: #63334, #65705, #65674
Cause
The serialization result of the DAG is as follows.
When serializing a dict containing lambda function definitions, the object's id value is reflected as-is, causing version changes on every serialization. The root cause is that when a dict contains callable objects, is_jsonable returns False, and since the dict itself is not callable, it falls through to that code path.
Solution
I added logic to filter out this case and convert callable objects consistently within the existing dict handling (sorting) logic. I have confirmed locally that this resolves the issue.
Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.