Skip to content

[improve](streaming-job) support cdc_client JVM opts and adopt externally-managed cdc_client#63898

Open
JNSimba wants to merge 3 commits into
apache:masterfrom
JNSimba:cdc-client-adopt-external
Open

[improve](streaming-job) support cdc_client JVM opts and adopt externally-managed cdc_client#63898
JNSimba wants to merge 3 commits into
apache:masterfrom
JNSimba:cdc-client-adopt-external

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented May 29, 2026

What

Two improvements to the BE-managed cdc_client lifecycle:

1. cdc_client_java_opts BE config + hardcoded OOM safety net

New be.conf entry to pass extra JVM options to the BE-forked cdc_client process. The value is whitespace-tokenized and inserted into the child argv before -jar.

-XX:+ExitOnOutOfMemoryError is hardcoded in BE (placed after the user opts, so JVM's last-wins rule prevents accidental disabling via cdc_client_java_opts). This guarantees every BE-forked cdc_client exits on OOM so BE can detect the dead child and re-fork — previously the JVM survived OOM in an unresponsive state and BE kept reporting "CDC client X unresponsive" without restarting.

Existing clusters get the OOM flag automatically by picking up the new BE binary; no be.conf edit required.

The startup uses execv instead of execlp to support variable-length argv. All heap-backed argv / path construction is done before fork(), and the child only performs async-signal-safe operations (open, dup2, close, execv, _exit) until execv() — this avoids deadlocking on libc/libstdc++ locks inherited from BE worker threads.

cdc_client_java_opts is registered as DEFINE_String (immutable), consistent with cdc_client_port — admins change it via be.conf + BE restart. This prevents a data race between admin set_config() writes and start_cdc_client() reads.

2. Adopt externally-managed cdc_client

start_cdc_client() now probes 127.0.0.1:cdc_client_port/actuator/health before forking. If a healthy cdc_client is already listening (e.g. one started manually for debug / hotfix), BE adopts it and skips fork instead of fork-looping against a port it cannot bind. Edge cases:

  • Forked child binds the port, runs normally: unchanged (BE manages it).
  • BE-forked child died and user manually started a replacement on the same port: next RPC adopts the external instance.
  • User stops their external cdc_client: next RPC's probe fails, BE falls back to fork.
  • fork() returns success and health passes but the new child has already exited (port held by an external process answering health): treated as adoption rather than masking the dead PID as "Start success".

A _adopted_external atomic edge-triggered flag throttles the "Adopting external cdc client" log so each mode transition prints exactly once.

Tests

  • Existing cdc_client_mgr_test.cpp cases unchanged (all new lifecycle logic lives behind #ifndef BE_TEST).
  • Two new tests covering the _adopted_external flag default value and setter/getter round-trip.
  • Real adoption / probe / fallback path is not yet covered by unit tests because check_cdc_client_health is compiled out under BE_TEST. A follow-up PR will add a test seam (function-pointer indirection or local HTTP fixture) to exercise these paths.

Test plan

  • Unit: cdc_client_mgr_test
  • Manual: kill BE-forked cdc_client, nohup java -jar cdc-client.jar ... on the same host; verify BE adopts it without fork-looping (be.INFO shows one-time Adopting external cdc client on port 9096).
  • Manual: trigger OOM in cdc_client; verify JVM exits and BE forks a healthy replacement.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba JNSimba requested a review from Copilot May 29, 2026 08:05
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 29, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 29, 2026

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves BE-managed cdc_client startup by adding configurable JVM options and allowing BE to adopt a healthy externally managed cdc_client already bound to the configured port.

Changes:

  • Adds cdc_client_java_opts config and passes tokenized JVM options before -jar.
  • Adds external health-probe adoption logic with _adopted_external state tracking.
  • Adds limited tests for the new adoption flag accessor behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
conf/be.conf Adds shipped default JVM option for BE-forked cdc_client.
be/src/common/config.h Declares the new mutable JVM options config.
be/src/common/config.cpp Defines the new JVM options config.
be/src/runtime/cdc_client_mgr.h Adds adoption state and test-only accessors.
be/src/runtime/cdc_client_mgr.cpp Implements external adoption probing and variable-length Java argv construction.
be/test/runtime/cdc_client_mgr_test.cpp Adds tests for _adopted_external default and setter/getter behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/common/config.cpp Outdated
Comment thread be/src/runtime/cdc_client_mgr.cpp
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for one BE runtime lifecycle issue.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to add CDC client JVM options and adopt externally managed CDC clients. The high-level goal is clear, but the production-only option parsing/adoption paths are not covered by meaningful tests because the new logic is mostly behind #ifndef BE_TEST.
  • Scope: The change is focused, but the argv construction moved non-trivial work into the post-fork child path.
  • Concurrency: start_cdc_client() is serialized by _start_mutex, but BE is a multi-threaded process; after fork() only async-signal-safe operations are safe before exec. The new C++ allocations/parsing in the child violate that constraint.
  • Lifecycle/static init: No new cross-TU static initialization dependency found. The child process lifecycle is affected by the issue commented inline.
  • Configuration: cdc_client_java_opts is added and read at startup; runtime changes only affect subsequent forks, which is reasonable for JVM startup options.
  • Compatibility/protocol: No incompatible FE/BE protocol or storage-format change found.
  • Parallel paths: CDC client startup is centralized here; no separate startup path found needing the same option handling.
  • Conditional checks: The external adoption check is intentional and logged; no duplicate prior review thread exists.
  • Test coverage/results: The added tests only exercise the test accessor/default flag behavior, not the real external adoption or argv-building behavior.
  • Observability: Adoption and managed-start transitions have INFO logs; no additional metrics appear necessary for this small lifecycle change.
  • Transaction/persistence/data writes: No direct transaction or persistence changes. CDC requests can lead to downstream writes, but this PR only changes process startup.
  • FE/BE variables: No new thrift/protobuf variable passing.
  • Performance: No hot-path CPU issue found; the blocking issue is post-fork safety.
  • User focus: No additional user-provided review focus was supplied.

Comment thread be/src/runtime/cdc_client_mgr.cpp Outdated
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 29, 2026

run buildall

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 29, 2026

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31600 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 289ae1528024e894f68e1b1144e360f1e382c9ca, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17898	4009	4009	4009
q2	q3	10763	1432	810	810
q4	4681	482	356	356
q5	7607	2316	2178	2178
q6	250	176	136	136
q7	959	796	638	638
q8	9364	1699	1591	1591
q9	5155	4984	4931	4931
q10	6401	2194	1860	1860
q11	450	269	243	243
q12	695	423	295	295
q13	18222	3478	2804	2804
q14	266	262	238	238
q15	q16	821	779	712	712
q17	1011	931	964	931
q18	6939	5722	5523	5523
q19	1237	1240	1068	1068
q20	687	470	293	293
q21	6178	2914	2671	2671
q22	586	361	313	313
Total cold run time: 100170 ms
Total hot run time: 31600 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4769	4758	4710	4710
q2	q3	4992	5372	4660	4660
q4	2118	2198	1387	1387
q5	4982	4729	4708	4708
q6	253	193	129	129
q7	1945	1745	1574	1574
q8	2474	2128	2126	2126
q9	7964	7475	7402	7402
q10	4778	4722	4248	4248
q11	539	385	353	353
q12	731	751	580	580
q13	3019	3369	2830	2830
q14	281	272	251	251
q15	q16	684	704	603	603
q17	1292	1260	1253	1253
q18	7315	6705	6806	6705
q19	1156	1116	1125	1116
q20	2216	2230	1953	1953
q21	5311	4593	4431	4431
q22	532	466	407	407
Total cold run time: 57351 ms
Total hot run time: 51426 ms

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment on lines +164 to +169
if (check_cdc_client_health(1, 0, adopt_response).ok()) {
if (!_adopted_external.exchange(true)) {
LOG(INFO) << "Adopting external cdc client on port "
<< doris::config::cdc_client_port;
}
return Status::OK();
Comment on lines +164 to +169
if (check_cdc_client_health(1, 0, adopt_response).ok()) {
if (!_adopted_external.exchange(true)) {
LOG(INFO) << "Adopting external cdc client on port "
<< doris::config::cdc_client_port;
}
return Status::OK();
Comment thread be/src/common/config.cpp Outdated

DEFINE_Int32(cdc_client_port, "9096");

DEFINE_mString(cdc_client_java_opts, "");
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional concurrency issue beyond the existing inline threads. The existing comments already cover the external-adoption validation risk, missing production-path adoption tests, and prior config/default discussion, so I did not duplicate those.

Critical checkpoint conclusions:

  • Goal and tests: the PR aims to add CDC client JVM opts and external process adoption. The JVM arg construction is now moved before fork and the OOM flag is hardcoded, but the external adoption behavior remains insufficiently tested as already noted in existing threads.
  • Scope: the change is mostly localized to CDC client startup/configuration.
  • Concurrency: start_cdc_client() is serialized by _start_mutex, but the new mutable string config can be changed concurrently through BE config APIs and is read without the required mutable-string config lock.
  • Lifecycle: CDC child fork/exec lifecycle is improved by prebuilding argv before fork. Existing review threads still cover adoption lifecycle risks around accepting an arbitrary healthy localhost service.
  • Configuration: a new mutable config is added; because it is mutable, runtime update semantics and thread-safe reads must be handled.
  • Compatibility and parallel paths: no storage/protocol compatibility issue found.
  • Tests: added tests only cover the test accessor state, not the production adoption probe/fallback path, as already raised.
  • Observability: adoption/start logs are present; no additional observability issue found beyond the validation/test concerns.
  • Transactions/persistence/data writes: not applicable.
  • Performance: no meaningful additional performance issue found.

User focus points: no additional user-provided review focus was supplied.


// Pre-build everything the child needs before fork(): heap allocation after
// fork() in a multi-threaded process can deadlock on inherited libc locks.
std::vector<std::string> argv_storage;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cdc_client_java_opts is registered with DEFINE_mString, so it can be changed at runtime through set_config(). String config updates are protected by config::get_mutable_string_config_lock() when assigning the underlying std::string, but this read copies the same string without that lock. A concurrent CDC request that reaches start_cdc_client() while an operator updates cdc_client_java_opts can race on the std::string, which is undefined behavior. Please either make this startup-only option immutable (DEFINE_String) or copy it while holding *config::get_mutable_string_config_lock() before parsing it.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172567 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 289ae1528024e894f68e1b1144e360f1e382c9ca, data reload: false

query5	4331	672	527	527
query6	351	245	209	209
query7	4290	561	331	331
query8	327	234	232	232
query9	8842	4076	4035	4035
query10	457	360	309	309
query11	5790	2604	2343	2343
query12	180	129	127	127
query13	1274	607	448	448
query14	6052	5480	5165	5165
query14_1	4464	4515	4485	4485
query15	215	206	195	195
query16	1009	455	464	455
query17	965	753	632	632
query18	2466	516	379	379
query19	223	217	172	172
query20	141	153	135	135
query21	218	146	114	114
query22	13588	13649	13358	13358
query23	17404	16444	16157	16157
query23_1	16269	16306	16431	16306
query24	7567	1782	1338	1338
query24_1	1303	1343	1300	1300
query25	554	491	419	419
query26	1309	325	180	180
query27	2672	560	350	350
query28	4432	2016	2028	2016
query29	1000	618	505	505
query30	309	241	201	201
query31	1118	1082	965	965
query32	91	77	71	71
query33	556	365	308	308
query34	1179	1150	658	658
query35	788	807	702	702
query36	1437	1379	1305	1305
query37	162	105	95	95
query38	3196	3165	3135	3135
query39	922	911	907	907
query39_1	871	889	885	885
query40	232	149	128	128
query41	68	61	66	61
query42	113	109	108	108
query43	331	332	287	287
query44	
query45	223	205	196	196
query46	1091	1238	738	738
query47	2410	2391	2293	2293
query48	426	406	309	309
query49	631	502	403	403
query50	970	355	261	261
query51	4470	4365	4244	4244
query52	105	108	95	95
query53	258	281	210	210
query54	316	275	257	257
query55	92	92	85	85
query56	307	323	296	296
query57	1452	1414	1347	1347
query58	319	282	275	275
query59	1577	1654	1486	1486
query60	325	326	342	326
query61	162	157	181	157
query62	707	665	588	588
query63	239	195	202	195
query64	2423	855	659	659
query65	
query66	1727	481	362	362
query67	29912	29821	29623	29623
query68	
query69	471	355	296	296
query70	1020	1066	1020	1020
query71	312	273	269	269
query72	2974	2786	2467	2467
query73	876	822	476	476
query74	5096	4947	4800	4800
query75	2698	2636	2294	2294
query76	2297	1168	743	743
query77	409	416	344	344
query78	12386	12463	11953	11953
query79	1266	1039	769	769
query80	579	533	487	487
query81	449	283	254	254
query82	243	162	120	120
query83	278	274	252	252
query84	282	137	107	107
query85	868	540	462	462
query86	366	327	355	327
query87	3388	3385	3257	3257
query88	3597	2724	2717	2717
query89	428	393	348	348
query90	2179	189	180	180
query91	179	173	140	140
query92	82	77	76	76
query93	1490	1461	940	940
query94	528	352	316	316
query95	692	387	349	349
query96	1124	851	369	369
query97	2716	2737	2597	2597
query98	239	239	246	239
query99	1166	1157	1035	1035
Total cold run time: 252788 ms
Total hot run time: 172567 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31124 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 95a415ea37740c5009c7af5ce4d8ca641c81d442, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17615	3966	3980	3966
q2	q3	10812	1358	818	818
q4	4691	468	346	346
q5	7574	2223	2127	2127
q6	327	177	137	137
q7	957	764	642	642
q8	9393	1833	1604	1604
q9	6919	4946	4904	4904
q10	6422	2199	1869	1869
q11	429	275	249	249
q12	695	422	300	300
q13	18205	3367	2833	2833
q14	260	261	235	235
q15	q16	819	777	708	708
q17	1011	981	894	894
q18	6745	5663	5647	5647
q19	1194	1249	1048	1048
q20	504	426	261	261
q21	5732	2593	2235	2235
q22	427	349	301	301
Total cold run time: 100731 ms
Total hot run time: 31124 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4312	4244	4258	4244
q2	q3	4546	4902	4347	4347
q4	2086	2196	1393	1393
q5	4404	4258	4310	4258
q6	235	261	245	245
q7	2103	1880	1659	1659
q8	2499	2139	2111	2111
q9	8010	7836	7824	7824
q10	4780	4732	4309	4309
q11	761	419	382	382
q12	741	760	531	531
q13	3334	3600	2958	2958
q14	291	317	281	281
q15	q16	710	724	617	617
q17	1346	1322	1338	1322
q18	7806	7380	6799	6799
q19	1152	1090	1137	1090
q20	2210	2223	1940	1940
q21	5244	4551	4460	4460
q22	515	451	401	401
Total cold run time: 57085 ms
Total hot run time: 51171 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172750 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 95a415ea37740c5009c7af5ce4d8ca641c81d442, data reload: false

query5	4318	688	533	533
query6	346	226	203	203
query7	4222	558	296	296
query8	329	239	223	223
query9	8794	4124	4151	4124
query10	466	360	318	318
query11	5812	2502	2268	2268
query12	182	129	131	129
query13	1279	635	453	453
query14	6089	5485	5178	5178
query14_1	4527	4480	4490	4480
query15	218	210	188	188
query16	1032	501	449	449
query17	1169	748	612	612
query18	2727	506	372	372
query19	224	214	174	174
query20	140	134	139	134
query21	225	144	130	130
query22	13597	13540	13320	13320
query23	17293	16521	16141	16141
query23_1	16421	16368	16397	16368
query24	7454	1797	1350	1350
query24_1	1374	1363	1351	1351
query25	609	510	459	459
query26	1313	329	185	185
query27	2692	559	359	359
query28	4429	2064	2057	2057
query29	1041	636	518	518
query30	305	243	202	202
query31	1122	1081	982	982
query32	96	78	77	77
query33	567	373	316	316
query34	1210	1181	648	648
query35	765	801	714	714
query36	1443	1398	1220	1220
query37	152	103	90	90
query38	3202	3174	3094	3094
query39	921	926	906	906
query39_1	869	861	885	861
query40	227	151	127	127
query41	65	64	63	63
query42	111	114	108	108
query43	340	342	291	291
query44	
query45	218	203	198	198
query46	1087	1177	786	786
query47	2425	2381	2283	2283
query48	421	428	299	299
query49	654	500	393	393
query50	1038	342	257	257
query51	4437	4371	4284	4284
query52	106	109	95	95
query53	258	289	219	219
query54	339	298	271	271
query55	101	92	101	92
query56	315	323	307	307
query57	1444	1487	1439	1439
query58	299	282	282	282
query59	1733	1767	1506	1506
query60	359	335	324	324
query61	165	158	149	149
query62	697	660	588	588
query63	259	203	209	203
query64	2422	807	650	650
query65	
query66	1678	485	355	355
query67	29836	29681	29548	29548
query68	
query69	467	351	307	307
query70	1035	1000	989	989
query71	305	314	269	269
query72	3106	2696	2456	2456
query73	896	758	416	416
query74	5110	5011	4836	4836
query75	2674	2610	2259	2259
query76	2310	1153	817	817
query77	413	430	347	347
query78	12438	12477	11948	11948
query79	1454	1014	795	795
query80	797	541	462	462
query81	471	292	245	245
query82	1366	162	120	120
query83	351	276	255	255
query84	263	140	111	111
query85	915	553	457	457
query86	440	334	341	334
query87	3424	3420	3234	3234
query88	3673	2798	2764	2764
query89	459	390	344	344
query90	1854	188	189	188
query91	181	170	145	145
query92	82	82	83	82
query93	1514	1428	930	930
query94	603	362	327	327
query95	672	392	424	392
query96	1108	797	356	356
query97	2733	2785	2573	2573
query98	240	226	231	226
query99	1176	1155	1036	1036
Total cold run time: 255439 ms
Total hot run time: 172750 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.99% (20999/38893)
Line Coverage 37.54% (199047/530267)
Region Coverage 33.81% (155963/461263)
Branch Coverage 34.80% (67891/195080)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 84.00% (21/25) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.99% (20999/38893)
Line Coverage 37.54% (199060/530289)
Region Coverage 33.79% (155874/461266)
Branch Coverage 34.80% (67896/195084)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants