Skip to content

fix: log uncaught exceptions from CliRunner main so JobManager pod logs capture them#342

Merged
ferenc-csaky merged 3 commits into
mainfrom
fix/log-main-method-exceptions
May 22, 2026
Merged

fix: log uncaught exceptions from CliRunner main so JobManager pod logs capture them#342
ferenc-csaky merged 3 commits into
mainfrom
fix/log-main-method-exceptions

Conversation

@velo
Copy link
Copy Markdown
Collaborator

@velo velo commented May 22, 2026

Problem

When the SQL job submitted via CliRunner fails at planning/submission time (e.g. an invalid EXECUTE STATEMENT SET BEGIN ... END), the JobManager pod restarts in a loop with no errors or warnings on kubectl logs. The only place the actual exception is visible is via kubectl describe pod, surfaced as a Kubernetes Event by the Flink Kubernetes Operator (which polls the Flink REST API).

This makes log scrapers (Fluent Bit, Loki, Datadog, etc.) completely blind to the failure — they only see the looping JM with no stack trace.

Fix

Wrap CliRunner.main() in a try { ... } catch (Throwable t) that calls log.error("flink-sql-runner failed", t) and rethrows. The rethrow preserves Flink's existing exception reporting through PackagedProgram (so the operator's REST-based status path still works), while the explicit log.error puts the full stack trace on stdout where log scrapers can pick it up.

Why not System.exit(1) + LogManager.shutdown()?

In mode: standalone the JobManager JVM keeps running after main() returns/throws — Flink reports the failure via REST and the operator tears the pod down later. Forcing System.exit would kill the JM hard and bypass the REST exception path. log4j2's default Console appender uses immediateFlush=true, so the log.error line is on stdout before the throw executes — no explicit flush needed.

Test plan

  • CI builds and existing tests pass
  • Deploy to a test cluster and trigger a deliberately broken SQL plan; verify the stack trace now appears in kubectl logs of the JobManager pod, not just kubectl describe pod

velo added 2 commits May 22, 2026 12:27
…gs capture them

Signed-off-by: Marvin Froeder <marvin@datasqrl.com>
…cit SqlRunner type

Signed-off-by: Marvin Froeder <marvin@datasqrl.com>
@velo velo requested a review from ferenc-csaky May 22, 2026 15:38
@velo velo enabled auto-merge (squash) May 22, 2026 15:38
@ferenc-csaky ferenc-csaky disabled auto-merge May 22, 2026 17:12
@ferenc-csaky ferenc-csaky enabled auto-merge (squash) May 22, 2026 17:13
@ferenc-csaky ferenc-csaky added the bug Something isn't working label May 22, 2026
@ferenc-csaky ferenc-csaky added this to the 0.10.3 milestone May 22, 2026
@ferenc-csaky ferenc-csaky merged commit edf2370 into main May 22, 2026
19 checks passed
@ferenc-csaky ferenc-csaky deleted the fix/log-main-method-exceptions branch May 22, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants