fix: log uncaught exceptions from CliRunner main so JobManager pod logs capture them#342
Merged
Merged
Conversation
…gs capture them Signed-off-by: Marvin Froeder <marvin@datasqrl.com>
…cit SqlRunner type Signed-off-by: Marvin Froeder <marvin@datasqrl.com>
ferenc-csaky
approved these changes
May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the SQL job submitted via
CliRunnerfails at planning/submission time (e.g. an invalidEXECUTE STATEMENT SET BEGIN ... END), the JobManager pod restarts in a loop with no errors or warnings onkubectl logs. The only place the actual exception is visible is viakubectl describe pod, surfaced as a Kubernetes Event by the Flink Kubernetes Operator (which polls the Flink REST API).This makes log scrapers (Fluent Bit, Loki, Datadog, etc.) completely blind to the failure — they only see the looping JM with no stack trace.
Fix
Wrap
CliRunner.main()in atry { ... } catch (Throwable t)that callslog.error("flink-sql-runner failed", t)and rethrows. The rethrow preserves Flink's existing exception reporting throughPackagedProgram(so the operator's REST-based status path still works), while the explicitlog.errorputs the full stack trace on stdout where log scrapers can pick it up.Why not
System.exit(1)+LogManager.shutdown()?In
mode: standalonethe JobManager JVM keeps running aftermain()returns/throws — Flink reports the failure via REST and the operator tears the pod down later. ForcingSystem.exitwould kill the JM hard and bypass the REST exception path. log4j2's default Console appender usesimmediateFlush=true, so thelog.errorline is on stdout before thethrowexecutes — no explicit flush needed.Test plan
kubectl logsof the JobManager pod, not justkubectl describe pod