[FLINK-39264][docs] Add docs for application management#27818
[FLINK-39264][docs] Add docs for application management#27818RocMarshal merged 3 commits intoapache:masterfrom
Conversation
| Even after all applications are finished, the cluster (and the JobManager) will | ||
| keep running until the session is manually stopped. The lifetime of a Flink | ||
| Session Cluster is therefore not bound to the lifetime of any Flink Job. | ||
| Session Cluster is therefore not bound to the lifetime of any Flink Application. |
There was a problem hiding this comment.
of any Flink Application. -> of any Flink Application or job.
There was a problem hiding this comment.
I've updated the documentation accordingly.
| * **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a | ||
| pre-existing, long-running cluster that can accept multiple job submissions. | ||
| Even after all jobs are finished, the cluster (and the JobManager) will | ||
| pre-existing, long-running cluster that can accept multiple application submissions. |
There was a problem hiding this comment.
I suggest a hyper link to the definition of application would be useful, or a quick summary.
I wonder if it is still mentioning jobs as well as applications. Or is every job now in an applicaiton?
There was a problem hiding this comment.
Regarding the definition of "application," it is actually mentioned at the very beginning of this section (Flink Application Execution) to set the context.
Your understanding is correct: a job now is always submitted by an application and is considered part of it. My take is that the key distinction to emphasize here is that a Session Cluster accepts multiple application submissions, which is a key difference from Application Mode.
|
|
||
| #### ApplicationResultStore | ||
|
|
||
| The ApplicationResultStore is a Flink component that persists the results of terminated |
There was a problem hiding this comment.
It would be worth doing into more detail as to what we mean by Results is this the last checkpoint / savepoint?
There was a problem hiding this comment.
The Application Result primarily contains high-level, final information about the application's execution. This includes its application ID, its final status (e.g., FINISHED, FAILED, CANCELED), and its name. It doesn't refer to the last checkpoint or savepoint, but rather the overall outcome.
Essentially, it's the application-level equivalent of a JobResult. I've added a brief note to clarify this. Thanks!
| **JobManager** | ||
|
|
||
| The archiving of completed jobs happens on the JobManager, which uploads the archived job information to a file system directory. You can configure the directory to archive completed jobs in [Flink configuration file]({{< ref "docs/deployment/config#flink-configuration-file" >}}) by setting a directory via `jobmanager.archive.fs.dir`. | ||
| The archiving of completed jobs and applications happens on the JobManager, which uploads the archived job and application information to a file system directory. You can configure the directory to archive completed jobs and applications in [Flink configuration file]({{< ref "docs/deployment/config#flink-configuration-file" >}}) by setting a directory via `jobmanager.archive.fs.dir`. |
There was a problem hiding this comment.
is this different from the application results store - as these are archives - it would be worth contrasting the two if they are different or referring to them in the same way if they are the same.
There was a problem hiding this comment.
The History Server and the Application Result Store are indeed completely different. Here’s the distinction:
- The Application Result Store is an internal mechanism. It stores only minimal information (e.g., application ID and final status) to mark an application as "terminated," preventing it from being incorrectly re-submitted or restarted during a failover.
- The History Server is a user-facing archival tool. It saves detailed information from completed applications/jobs by caching their REST API responses (like /applications/:appid, etc.). This allows users to query and inspect application/job details long after the cluster has shut down.
I have added a brief comparison of the two to the glossary section. Thanks for the feedback!
|
|
||
| - `/applications/overview` | ||
| - `/applications/<applicationid>` | ||
| - `/applications/<applicationid>/jobmanager/config` |
There was a problem hiding this comment.
can you see the jobs that were under an application? This would seem to be the most useful thing you would want to see.
There was a problem hiding this comment.
You can indeed see all the jobs that were part of a completed application. The History Server's REST API is designed to mirror the standard JobManager REST API. This means that when you request an application's overview or details, the response naturally includes information about the jobs within it.
To make this clear, I've added a note to the documentation explaining this behavior and have also included a link to the JobManager REST API page for reference on the response format. Thanks!
|
|
||
| JobManager High Availability (HA) hardens a Flink cluster against JobManager failures. | ||
| This feature ensures that a Flink cluster will always continue executing your submitted jobs. | ||
| This feature ensures that a Flink cluster will always re-execute your submitted applications that were running at the time of a failure. |
There was a problem hiding this comment.
My understanding is that you're asking about the checkpoints of jobs within a re-executed application. When an application is re-executed in HA mode, what happens to its running jobs is determined by the application's own logic in the main method:
- Resumption: If the logic in the main method re-submits the job, it will automatically resume from its latest checkpoint.
- Abandonment: If the application's logic does not re-submit the job, it is considered abandoned. The job will be moved to a FAILED state, and its resources, including all checkpoints, are properly cleaned up.
I've added some explanation to the documentation to make the behavior clear.
| The HA data will be kept until the respective job either succeeds, is cancelled or fails terminally. | ||
| Once this happens, all the HA data, including the metadata stored in the HA services, will be deleted. | ||
| In order to recover submitted applications, Flink persists metadata for the applications. | ||
| The HA data will be kept until the respective application either succeeds, is cancelled or fails terminally. |
There was a problem hiding this comment.
I am curious what fails terminally might mean - some examples of types of this would be useful.
There was a problem hiding this comment.
What I intended to express is the concept of an application reaching any terminal state. To make this clear, I've updated the documentation to explicitly list the three terminal states. This should be more precise. Thanks!
RocMarshal
left a comment
There was a problem hiding this comment.
Hi, @eemario Thank you for the contribution.
LGTM on the whole.
As described in [1], It would be great if links (anchor points) could be added to each heading according to the documentation requirements.
[1] https://cwiki.apache.org/confluence/display/FLINK/Flink+Translation+Specifications
To keep the current changes simple, it seems preferable to add anchor links only to the headings introduced in this PR. This helps maintain a minimal and focused scope of changes—adhering to the anchor-linking principle without expanding the modified content beyond what is described in the JIRA title.
WDYTA ?
|
Hi @RocMarshal , |
|
Hi, @eemario Could you help make a BP-PR for release-2.3 ? Thanks |
|
Hi @RocMarshal , |
What is the purpose of the change
This pull request adds docs for application management.
Brief change log
Verifying this change
This change is a trivial rework / code cleanup without any test coverage.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation