fix: log uncaught exceptions from CliRunner main so JobManager pod logs capture them by velo · Pull Request #342 · DataSQRL/flink-sql-runner

velo · 2026-05-22T15:28:06Z

Problem

When the SQL job submitted via CliRunner fails at planning/submission time (e.g. an invalid EXECUTE STATEMENT SET BEGIN ... END), the JobManager pod restarts in a loop with no errors or warnings on kubectl logs. The only place the actual exception is visible is via kubectl describe pod, surfaced as a Kubernetes Event by the Flink Kubernetes Operator (which polls the Flink REST API).

This makes log scrapers (Fluent Bit, Loki, Datadog, etc.) completely blind to the failure — they only see the looping JM with no stack trace.

Fix

Wrap CliRunner.main() in a try { ... } catch (Throwable t) that calls log.error("flink-sql-runner failed", t) and rethrows. The rethrow preserves Flink's existing exception reporting through PackagedProgram (so the operator's REST-based status path still works), while the explicit log.error puts the full stack trace on stdout where log scrapers can pick it up.

Why not `System.exit(1)` + `LogManager.shutdown()`?

In mode: standalone the JobManager JVM keeps running after main() returns/throws — Flink reports the failure via REST and the operator tears the pod down later. Forcing System.exit would kill the JM hard and bypass the REST exception path. log4j2's default Console appender uses immediateFlush=true, so the log.error line is on stdout before the throw executes — no explicit flush needed.

Test plan

CI builds and existing tests pass
Deploy to a test cluster and trigger a deliberately broken SQL plan; verify the stack trace now appears in kubectl logs of the JobManager pod, not just kubectl describe pod

…gs capture them Signed-off-by: Marvin Froeder <marvin@datasqrl.com>

…cit SqlRunner type Signed-off-by: Marvin Froeder <marvin@datasqrl.com>

velo added 2 commits May 22, 2026 12:27

fix: log uncaught exceptions from CliRunner main so JobManager pod lo…

9c63406

…gs capture them Signed-off-by: Marvin Froeder <marvin@datasqrl.com>

fix: narrow CliRunner main try/catch to wrap only run() and use expli…

d3fb1ff

…cit SqlRunner type Signed-off-by: Marvin Froeder <marvin@datasqrl.com>

velo requested a review from ferenc-csaky May 22, 2026 15:38

velo enabled auto-merge (squash) May 22, 2026 15:38

add test case, make comment more concise

ca846ff

ferenc-csaky disabled auto-merge May 22, 2026 17:12

ferenc-csaky enabled auto-merge (squash) May 22, 2026 17:13

ferenc-csaky approved these changes May 22, 2026

View reviewed changes

ferenc-csaky added the bug Something isn't working label May 22, 2026

ferenc-csaky added this to the 0.10.3 milestone May 22, 2026

ferenc-csaky merged commit edf2370 into main May 22, 2026
19 checks passed

ferenc-csaky deleted the fix/log-main-method-exceptions branch May 22, 2026 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: log uncaught exceptions from CliRunner main so JobManager pod logs capture them#342

fix: log uncaught exceptions from CliRunner main so JobManager pod logs capture them#342
ferenc-csaky merged 3 commits into
mainfrom
fix/log-main-method-exceptions

velo commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

velo commented May 22, 2026

Problem

Fix

Why not System.exit(1) + LogManager.shutdown()?

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why not `System.exit(1)` + `LogManager.shutdown()`?