[Fix](pyudf) Convert nested map value correctly by linrrzqqq · Pull Request #63907 · apache/doris

linrrzqqq · 2026-05-29T19:40:46Z

Problem Summary:

Fix Python UDF nested complex type conversion when MAP appears inside ARRAY, STRUCT, or vectorized inputs.

Previously, Python UDF argument conversion mostly relied on PyArrow's default conversions(Scalar.as_py(), Array.to_pylist(), Array.to_pandas()). Those APIs convert a top-level Arrow MAP into Python-friendly values in some paths, but nested MAP values are exposed as list-of-tuples. For example, ARRAY<MAP<STRING, INT>> could arrive in Python as [[('a', 1)]] instead of [{'a': 1}]. This made user UDF code see nested maps as list instead of dict.

This PR introduces a recursive Arrow-value conversion helper and applies it consistently across Python UDF argument conversion paths. The helper manually reconstructs Python values according to the Arrow type:

MAP -> dict
LIST / LARGE_LIST -> list
STRUCT -> dict

before

CREATE FUNCTION py_deep_nested_debug(ARRAY<MAP<STRING, ARRAY<INT>>> )
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
def evaluate(arr):
    if arr is None:
        return 'None'
    return 'outer_type={}, outer_repr={}'.format(type(arr).__name__, repr(arr))
$$;

SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])             |
+-------------------------------------------------------------------------------+
| outer_type=list, outer_repr=[[('a', [1, 2]), ('b', [3])], [('c', [4, 5, 6])]] |
+-------------------------------------------------------------------------------+

now:

SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])       |
+-------------------------------------------------------------------------+
| outer_type=list, outer_repr=[{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}] |
+-------------------------------------------------------------------------+

hello-stephen · 2026-05-29T19:40:52Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

linrrzqqq · 2026-05-29T19:51:02Z

/review

github-actions

Summary opinion: Request changes. The PR fixes the intended nested MAP representation problem, but the current implementation also adds per-row Python conversion to every vectorized UDF argument, including primitive pandas/list arguments that do not need normalization, and the new regression test has deterministic-output/debuggability issues that should be fixed before merge.

Critical checkpoint conclusions:

Goal/test: The goal is clear and the new test covers scalar, vector-list, vector-series, and mixed vector/scalar nested MAP cases, but one expected-result query is not ordered.
Scope/focus: The feature change is small, but the conversion is applied more broadly than necessary and affects all vectorized UDF inputs.
Concurrency/lifecycle/config/compatibility/persistence/data writes: No new concurrency, lifecycle management, configs, storage format, persistence, or transaction behavior found in this PR.
Parallel paths: The scalar and vectorized Python UDF argument paths were both updated, but the vectorized primitive path now pays the nested conversion cost unnecessarily.
Error handling/observability: Existing exception propagation/logging remains unchanged; no new observability requirement found.
Test coverage/results: Coverage is relevant, but the new regression test violates Doris ordering and cleanup standards.
User focus: No additional user-provided review focus was specified.
Local verification: I inspected the diff and review context; I could not run a PyArrow reproduction because pyarrow is not installed in this runner environment.

github-actions · 2026-05-29T19:57:16Z

        elif vec_type == VectorType.PANDAS_SERIES:
-            return arrow_array.to_pandas()
+            return arrow_array.to_pandas().apply(
+                lambda value: convert_arrow_value_to_python(value, arrow_array.type)


This now runs a Python-level recursive conversion for every element of every vectorized UDF argument, even when arrow_array.type is a primitive type that cannot contain a nested MAP. For pd.Series this replaces the previous arrow_array.to_pandas() fast path with .apply(lambda ...) over the whole column, so existing vectorized Python UDFs on primitive columns regress even though they do not need this fix. Please gate the recursive conversion to Arrow types that can actually contain nested values needing normalization, and keep the old direct to_pandas()/to_pylist() path for primitive/non-nested inputs.

github-actions · 2026-05-29T19:57:16Z

+        """
+
+        qt_scalar_constant_nested_complex """
+            SELECT py_nested_complex_scalar(


This UNION ALL query has no ORDER BY and uses qt_ rather than order_qt_, so the three expected rows are not guaranteed to be returned in the order recorded in the .out file. Doris regression tests require deterministic output; please either use order_qt_scalar_constant_nested_complex or add an explicit ordering column around the union.

github-actions · 2026-05-29T19:57:16Z

+        """
+    } finally {
+        try_sql("DROP TABLE IF EXISTS test_pythonudf_nested_complex_type;")
+    }


The regression-test standard for Doris is to drop tables before use, not after, so the environment remains available for debugging when a test fails. This test already drops the table at the beginning; please remove the finally cleanup that drops it again after execution.

[Fix](pyudf) Convert nested map value correctly

641b6ff

github-actions Bot requested changes May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix](pyudf) Convert nested map value correctly#63907

[Fix](pyudf) Convert nested map value correctly#63907
linrrzqqq wants to merge 1 commit into
apache:masterfrom
linrrzqqq:pyudf-nested-map

linrrzqqq commented May 29, 2026

Uh oh!

hello-stephen commented May 29, 2026

Uh oh!

linrrzqqq commented May 29, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot May 29, 2026

Uh oh!

github-actions Bot May 29, 2026

Uh oh!

github-actions Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

linrrzqqq commented May 29, 2026

Uh oh!

hello-stephen commented May 29, 2026

Uh oh!

linrrzqqq commented May 29, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants