[Fix](pyudf) Convert nested map value correctly#63907
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Summary opinion: Request changes. The PR fixes the intended nested MAP representation problem, but the current implementation also adds per-row Python conversion to every vectorized UDF argument, including primitive pandas/list arguments that do not need normalization, and the new regression test has deterministic-output/debuggability issues that should be fixed before merge.
Critical checkpoint conclusions:
- Goal/test: The goal is clear and the new test covers scalar, vector-list, vector-series, and mixed vector/scalar nested MAP cases, but one expected-result query is not ordered.
- Scope/focus: The feature change is small, but the conversion is applied more broadly than necessary and affects all vectorized UDF inputs.
- Concurrency/lifecycle/config/compatibility/persistence/data writes: No new concurrency, lifecycle management, configs, storage format, persistence, or transaction behavior found in this PR.
- Parallel paths: The scalar and vectorized Python UDF argument paths were both updated, but the vectorized primitive path now pays the nested conversion cost unnecessarily.
- Error handling/observability: Existing exception propagation/logging remains unchanged; no new observability requirement found.
- Test coverage/results: Coverage is relevant, but the new regression test violates Doris ordering and cleanup standards.
- User focus: No additional user-provided review focus was specified.
- Local verification: I inspected the diff and review context; I could not run a PyArrow reproduction because pyarrow is not installed in this runner environment.
| elif vec_type == VectorType.PANDAS_SERIES: | ||
| return arrow_array.to_pandas() | ||
| return arrow_array.to_pandas().apply( | ||
| lambda value: convert_arrow_value_to_python(value, arrow_array.type) |
There was a problem hiding this comment.
This now runs a Python-level recursive conversion for every element of every vectorized UDF argument, even when arrow_array.type is a primitive type that cannot contain a nested MAP. For pd.Series this replaces the previous arrow_array.to_pandas() fast path with .apply(lambda ...) over the whole column, so existing vectorized Python UDFs on primitive columns regress even though they do not need this fix. Please gate the recursive conversion to Arrow types that can actually contain nested values needing normalization, and keep the old direct to_pandas()/to_pylist() path for primitive/non-nested inputs.
| """ | ||
|
|
||
| qt_scalar_constant_nested_complex """ | ||
| SELECT py_nested_complex_scalar( |
There was a problem hiding this comment.
This UNION ALL query has no ORDER BY and uses qt_ rather than order_qt_, so the three expected rows are not guaranteed to be returned in the order recorded in the .out file. Doris regression tests require deterministic output; please either use order_qt_scalar_constant_nested_complex or add an explicit ordering column around the union.
| """ | ||
| } finally { | ||
| try_sql("DROP TABLE IF EXISTS test_pythonudf_nested_complex_type;") | ||
| } |
There was a problem hiding this comment.
The regression-test standard for Doris is to drop tables before use, not after, so the environment remains available for debugging when a test fails. This test already drops the table at the beginning; please remove the finally cleanup that drops it again after execution.
Problem Summary:
Fix Python UDF nested complex type conversion when
MAPappears insideARRAY,STRUCT, or vectorized inputs.Previously, Python UDF argument conversion mostly relied on PyArrow's default conversions(
Scalar.as_py(),Array.to_pylist(),Array.to_pandas()). Those APIs convert a top-level ArrowMAPinto Python-friendly values in some paths, but nestedMAPvalues are exposed as list-of-tuples. For example,ARRAY<MAP<STRING, INT>>could arrive in Python as[[('a', 1)]]instead of[{'a': 1}]. This made user UDF code see nested maps aslistinstead ofdict.This PR introduces a recursive Arrow-value conversion helper and applies it consistently across Python UDF argument conversion paths. The helper manually reconstructs Python values according to the Arrow type:
MAP->dictLIST/LARGE_LIST->listSTRUCT->dictbefore
now: