Skip to content

[Fix](pyudf) Convert nested map value correctly#63907

Open
linrrzqqq wants to merge 1 commit into
apache:masterfrom
linrrzqqq:pyudf-nested-map
Open

[Fix](pyudf) Convert nested map value correctly#63907
linrrzqqq wants to merge 1 commit into
apache:masterfrom
linrrzqqq:pyudf-nested-map

Conversation

@linrrzqqq
Copy link
Copy Markdown
Collaborator

Problem Summary:

Fix Python UDF nested complex type conversion when MAP appears inside ARRAY, STRUCT, or vectorized inputs.

Previously, Python UDF argument conversion mostly relied on PyArrow's default conversions(Scalar.as_py(), Array.to_pylist(), Array.to_pandas()). Those APIs convert a top-level Arrow MAP into Python-friendly values in some paths, but nested MAP values are exposed as list-of-tuples. For example, ARRAY<MAP<STRING, INT>> could arrive in Python as [[('a', 1)]] instead of [{'a': 1}]. This made user UDF code see nested maps as list instead of dict.

This PR introduces a recursive Arrow-value conversion helper and applies it consistently across Python UDF argument conversion paths. The helper manually reconstructs Python values according to the Arrow type:

  • MAP -> dict
  • LIST / LARGE_LIST -> list
  • STRUCT -> dict

before

CREATE FUNCTION py_deep_nested_debug(ARRAY<MAP<STRING, ARRAY<INT>>> )
RETURNS STRING
PROPERTIES (
    "type" = "PYTHON_UDF",
    "symbol" = "evaluate",
    "runtime_version" = "3.12.11",
    "always_nullable" = "true"
)
AS $$
def evaluate(arr):
    if arr is None:
        return 'None'
    return 'outer_type={}, outer_repr={}'.format(type(arr).__name__, repr(arr))
$$;

SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])             |
+-------------------------------------------------------------------------------+
| outer_type=list, outer_repr=[[('a', [1, 2]), ('b', [3])], [('c', [4, 5, 6])]] |
+-------------------------------------------------------------------------------+

now:

SELECT py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}]);
+-------------------------------------------------------------------------+
| py_deep_nested_debug([{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}])       |
+-------------------------------------------------------------------------+
| outer_type=list, outer_repr=[{'a': [1, 2], 'b': [3]}, {'c': [4, 5, 6]}] |
+-------------------------------------------------------------------------+

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary opinion: Request changes. The PR fixes the intended nested MAP representation problem, but the current implementation also adds per-row Python conversion to every vectorized UDF argument, including primitive pandas/list arguments that do not need normalization, and the new regression test has deterministic-output/debuggability issues that should be fixed before merge.

Critical checkpoint conclusions:

  • Goal/test: The goal is clear and the new test covers scalar, vector-list, vector-series, and mixed vector/scalar nested MAP cases, but one expected-result query is not ordered.
  • Scope/focus: The feature change is small, but the conversion is applied more broadly than necessary and affects all vectorized UDF inputs.
  • Concurrency/lifecycle/config/compatibility/persistence/data writes: No new concurrency, lifecycle management, configs, storage format, persistence, or transaction behavior found in this PR.
  • Parallel paths: The scalar and vectorized Python UDF argument paths were both updated, but the vectorized primitive path now pays the nested conversion cost unnecessarily.
  • Error handling/observability: Existing exception propagation/logging remains unchanged; no new observability requirement found.
  • Test coverage/results: Coverage is relevant, but the new regression test violates Doris ordering and cleanup standards.
  • User focus: No additional user-provided review focus was specified.
  • Local verification: I inspected the diff and review context; I could not run a PyArrow reproduction because pyarrow is not installed in this runner environment.

elif vec_type == VectorType.PANDAS_SERIES:
return arrow_array.to_pandas()
return arrow_array.to_pandas().apply(
lambda value: convert_arrow_value_to_python(value, arrow_array.type)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now runs a Python-level recursive conversion for every element of every vectorized UDF argument, even when arrow_array.type is a primitive type that cannot contain a nested MAP. For pd.Series this replaces the previous arrow_array.to_pandas() fast path with .apply(lambda ...) over the whole column, so existing vectorized Python UDFs on primitive columns regress even though they do not need this fix. Please gate the recursive conversion to Arrow types that can actually contain nested values needing normalization, and keep the old direct to_pandas()/to_pylist() path for primitive/non-nested inputs.

"""

qt_scalar_constant_nested_complex """
SELECT py_nested_complex_scalar(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This UNION ALL query has no ORDER BY and uses qt_ rather than order_qt_, so the three expected rows are not guaranteed to be returned in the order recorded in the .out file. Doris regression tests require deterministic output; please either use order_qt_scalar_constant_nested_complex or add an explicit ordering column around the union.

"""
} finally {
try_sql("DROP TABLE IF EXISTS test_pythonudf_nested_complex_type;")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression-test standard for Doris is to drop tables before use, not after, so the environment remains available for debugging when a test fails. This test already drops the table at the beginning; please remove the finally cleanup that drops it again after execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants