Skip to content

[Enhancement](udf) Do not check file when inline code exists#63906

Open
linrrzqqq wants to merge 1 commit into
apache:masterfrom
linrrzqqq:pyudf-prefer-inline-code
Open

[Enhancement](udf) Do not check file when inline code exists#63906
linrrzqqq wants to merge 1 commit into
apache:masterfrom
linrrzqqq:pyudf-prefer-inline-code

Conversation

@linrrzqqq
Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Before this change, even when inline code was present, the FE would still attempt to parse and validate the FILE in the CREATE FUNCTION statement. However, during subsequent execution, even if FILE is valid, it would not be used. Therefore, when inline code is present, we can omit checking the FILE field when creating the table.

before

DROP FUNCTION IF EXISTS py_inline_file_udf(INT);
CREATE FUNCTION py_inline_file_udf(INT)
RETURNS INT
PROPERTIES (
  "type"="PYTHON_UDF",
  "file"="http://127.0.0.1:12345/non_existent.zip",
  "symbol"="evaluate",
  "runtime_version"="3.12.11",
  "always_nullable"="true"
)
AS $$
def evaluate(x):
    if x is None:
        return None
    return x + 100
$$;

SELECT py_inline_file_udf(val) FROM t_repro ORDER BY id;
-- errCode = 2, detailMessage = cannot to compute object's checksum.

now

Doris> SELECT py_inline_file_udf(val) FROM t_repro ORDER BY id;
+-------------------------+
| py_inline_file_udf(val) |
+-------------------------+
|                     110 |
|                     120 |
|                     130 |
+-------------------------+

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found.

Critical checkpoint conclusions:

  • Goal and proof: the PR makes inline Python code take precedence over FILE validation/loading, and the added regression case covers inline code with a missing FILE path.
  • Scope: the implementation is small and focused in CREATE FUNCTION analysis plus one regression assertion.
  • Concurrency: no new shared mutable state or locking behavior is introduced.
  • Lifecycle/static initialization: no new lifecycle-managed objects or static initialization dependencies are introduced.
  • Configuration: no new configuration item is added.
  • Compatibility: no storage format or protocol field is changed; existing function_code and location fields are used. BE already prefers inline code before module location.
  • Parallel paths: scalar/UDAF/UDTF Python creation paths all receive the same analyzed functionCode behavior through analyzeCommon and their existing setters.
  • Conditional checks: the new skip condition is specific to inline code plus non-RPC file validation and matches the BE load precedence.
  • Tests: regression output was added for the inline-over-file case. I did not run the regression suite in this runner.
  • Observability: no new observability is needed for this analysis-time behavior.
  • Transaction/persistence/data writes: no transaction or data visibility path is affected; function metadata remains persisted through the existing Function fields.
  • FE/BE variables: no new FE-to-BE variable is added; existing function_code, hdfs_location, and checksum behavior remains consistent with BE selection logic.
  • Performance: avoids unnecessary URL resolution and checksum IO when inline Python code is authoritative.
  • User focus: no additional user-provided focus points were present.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30777 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8f0d728d687d8e7209df94c7ca5d835052545dba, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17736	3971	3940	3940
q2	q3	10760	1352	845	845
q4	4687	474	346	346
q5	7531	2199	2114	2114
q6	308	174	135	135
q7	946	792	635	635
q8	9362	1619	1589	1589
q9	6359	4947	4891	4891
q10	6429	2237	1866	1866
q11	431	274	245	245
q12	685	432	292	292
q13	18199	3380	2722	2722
q14	274	256	235	235
q15	q16	816	777	703	703
q17	994	955	908	908
q18	6692	5700	5493	5493
q19	1282	1292	1024	1024
q20	528	400	255	255
q21	5639	2674	2243	2243
q22	423	354	296	296
Total cold run time: 100081 ms
Total hot run time: 30777 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4336	4274	4241	4241
q2	q3	4491	4922	4302	4302
q4	2066	2189	1381	1381
q5	4368	4270	4290	4270
q6	224	171	127	127
q7	2095	1916	1704	1704
q8	2506	2128	2079	2079
q9	8077	7936	7915	7915
q10	4808	4787	4292	4292
q11	616	504	386	386
q12	725	743	525	525
q13	3370	3653	2965	2965
q14	300	298	279	279
q15	q16	719	728	649	649
q17	1380	1304	1329	1304
q18	7811	7313	6812	6812
q19	1094	1053	1097	1053
q20	2208	2175	1967	1967
q21	5190	4482	4385	4385
q22	522	458	414	414
Total cold run time: 56906 ms
Total hot run time: 51050 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171979 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8f0d728d687d8e7209df94c7ca5d835052545dba, data reload: false

query5	4337	662	536	536
query6	344	220	193	193
query7	4243	575	320	320
query8	315	244	226	226
query9	8811	3985	3996	3985
query10	459	352	330	330
query11	5751	2395	2232	2232
query12	188	136	125	125
query13	1372	621	444	444
query14	6133	5485	5166	5166
query14_1	4500	4468	4475	4468
query15	220	208	186	186
query16	1055	449	418	418
query17	1131	735	584	584
query18	2517	476	345	345
query19	210	200	163	163
query20	142	131	124	124
query21	221	133	118	118
query22	13602	13525	13347	13347
query23	17257	16548	16098	16098
query23_1	16391	16236	16339	16236
query24	7493	1769	1332	1332
query24_1	1312	1289	1325	1289
query25	536	480	421	421
query26	1297	316	177	177
query27	2699	590	331	331
query28	4478	2045	2002	2002
query29	999	625	507	507
query30	310	240	200	200
query31	1138	1076	944	944
query32	88	80	74	74
query33	543	354	319	319
query34	1163	1127	653	653
query35	761	775	708	708
query36	1355	1406	1261	1261
query37	156	104	95	95
query38	3187	3171	3039	3039
query39	933	907	884	884
query39_1	881	875	892	875
query40	219	145	123	123
query41	75	72	61	61
query42	109	109	110	109
query43	325	342	287	287
query44	
query45	211	205	199	199
query46	1069	1181	721	721
query47	2314	2389	2274	2274
query48	397	419	294	294
query49	625	513	393	393
query50	997	332	256	256
query51	4439	4411	4345	4345
query52	108	109	93	93
query53	263	284	206	206
query54	324	273	253	253
query55	95	91	86	86
query56	306	321	319	319
query57	1479	1450	1382	1382
query58	302	275	269	269
query59	1718	1745	1541	1541
query60	328	332	309	309
query61	165	166	168	166
query62	709	668	597	597
query63	253	204	209	204
query64	2400	799	674	674
query65	
query66	2104	496	373	373
query67	29751	29677	29620	29620
query68	
query69	485	344	309	309
query70	1027	966	998	966
query71	309	283	275	275
query72	3176	2636	2462	2462
query73	845	745	438	438
query74	5099	4960	4784	4784
query75	2681	2604	2260	2260
query76	2312	1148	770	770
query77	415	407	338	338
query78	12462	12431	11912	11912
query79	1454	1006	762	762
query80	669	532	449	449
query81	456	280	240	240
query82	1405	157	121	121
query83	357	283	253	253
query84	256	139	112	112
query85	889	558	459	459
query86	394	325	343	325
query87	3446	3399	3228	3228
query88	3572	2717	2708	2708
query89	441	394	341	341
query90	1982	178	179	178
query91	212	171	135	135
query92	78	76	69	69
query93	1563	1541	920	920
query94	534	358	302	302
query95	655	387	355	355
query96	1027	786	363	363
query97	2734	2703	2662	2662
query98	237	227	224	224
query99	1173	1156	993	993
Total cold run time: 254953 ms
Total hot run time: 171979 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/1) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants