Skip to content

[FLINK-39264][docs] Add docs for application management#27818

Merged
RocMarshal merged 3 commits intoapache:masterfrom
eemario:FLIP560-9
Apr 20, 2026
Merged

[FLINK-39264][docs] Add docs for application management#27818
RocMarshal merged 3 commits intoapache:masterfrom
eemario:FLIP560-9

Conversation

@eemario
Copy link
Copy Markdown
Contributor

@eemario eemario commented Mar 24, 2026

What is the purpose of the change

This pull request adds docs for application management.

Brief change log

  • Add a new page for application
  • Update outdated descriptions to reflect current functionality

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Mar 24, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@eemario eemario changed the title [FLINK-38972][docs] Add docs for application management [FLINK-39264][docs] Add docs for application management Mar 25, 2026
@eemario eemario marked this pull request as ready for review March 25, 2026 03:51
Even after all applications are finished, the cluster (and the JobManager) will
keep running until the session is manually stopped. The lifetime of a Flink
Session Cluster is therefore not bound to the lifetime of any Flink Job.
Session Cluster is therefore not bound to the lifetime of any Flink Application.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of any Flink Application. -> of any Flink Application or job.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the documentation accordingly.

* **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a
pre-existing, long-running cluster that can accept multiple job submissions.
Even after all jobs are finished, the cluster (and the JobManager) will
pre-existing, long-running cluster that can accept multiple application submissions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a hyper link to the definition of application would be useful, or a quick summary.
I wonder if it is still mentioning jobs as well as applications. Or is every job now in an applicaiton?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the definition of "application," it is actually mentioned at the very beginning of this section (Flink Application Execution) to set the context.
Your understanding is correct: a job now is always submitted by an application and is considered part of it. My take is that the key distinction to emphasize here is that a Session Cluster accepts multiple application submissions, which is a key difference from Application Mode.


#### ApplicationResultStore

The ApplicationResultStore is a Flink component that persists the results of terminated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be worth doing into more detail as to what we mean by Results is this the last checkpoint / savepoint?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Application Result primarily contains high-level, final information about the application's execution. This includes its application ID, its final status (e.g., FINISHED, FAILED, CANCELED), and its name. It doesn't refer to the last checkpoint or savepoint, but rather the overall outcome.
Essentially, it's the application-level equivalent of a JobResult. I've added a brief note to clarify this. Thanks!

**JobManager**

The archiving of completed jobs happens on the JobManager, which uploads the archived job information to a file system directory. You can configure the directory to archive completed jobs in [Flink configuration file]({{< ref "docs/deployment/config#flink-configuration-file" >}}) by setting a directory via `jobmanager.archive.fs.dir`.
The archiving of completed jobs and applications happens on the JobManager, which uploads the archived job and application information to a file system directory. You can configure the directory to archive completed jobs and applications in [Flink configuration file]({{< ref "docs/deployment/config#flink-configuration-file" >}}) by setting a directory via `jobmanager.archive.fs.dir`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different from the application results store - as these are archives - it would be worth contrasting the two if they are different or referring to them in the same way if they are the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The History Server and the Application Result Store are indeed completely different. Here’s the distinction:

  • The Application Result Store is an internal mechanism. It stores only minimal information (e.g., application ID and final status) to mark an application as "terminated," preventing it from being incorrectly re-submitted or restarted during a failover.
  • The History Server is a user-facing archival tool. It saves detailed information from completed applications/jobs by caching their REST API responses (like /applications/:appid, etc.). This allows users to query and inspect application/job details long after the cluster has shut down.

I have added a brief comparison of the two to the glossary section. Thanks for the feedback!


- `/applications/overview`
- `/applications/<applicationid>`
- `/applications/<applicationid>/jobmanager/config`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you see the jobs that were under an application? This would seem to be the most useful thing you would want to see.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can indeed see all the jobs that were part of a completed application. The History Server's REST API is designed to mirror the standard JobManager REST API. This means that when you request an application's overview or details, the response naturally includes information about the jobs within it.

To make this clear, I've added a note to the documentation explaining this behavior and have also included a link to the JobManager REST API page for reference on the response format. Thanks!


JobManager High Availability (HA) hardens a Flink cluster against JobManager failures.
This feature ensures that a Flink cluster will always continue executing your submitted jobs.
This feature ensures that a Flink cluster will always re-execute your submitted applications that were running at the time of a failure.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about checkpoints?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that you're asking about the checkpoints of jobs within a re-executed application. When an application is re-executed in HA mode, what happens to its running jobs is determined by the application's own logic in the main method:

  • Resumption: If the logic in the main method re-submits the job, it will automatically resume from its latest checkpoint.
  • Abandonment: If the application's logic does not re-submit the job, it is considered abandoned. The job will be moved to a FAILED state, and its resources, including all checkpoints, are properly cleaned up.

I've added some explanation to the documentation to make the behavior clear.

The HA data will be kept until the respective job either succeeds, is cancelled or fails terminally.
Once this happens, all the HA data, including the metadata stored in the HA services, will be deleted.
In order to recover submitted applications, Flink persists metadata for the applications.
The HA data will be kept until the respective application either succeeds, is cancelled or fails terminally.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious what fails terminally might mean - some examples of types of this would be useful.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I intended to express is the concept of an application reaching any terminal state. To make this clear, I've updated the documentation to explicitly list the three terminal states. This should be more precise. Thanks!

Copy link
Copy Markdown
Contributor

@RocMarshal RocMarshal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @eemario Thank you for the contribution.

LGTM on the whole.

As described in [1], It would be great if links (anchor points) could be added to each heading according to the documentation requirements.

Image

[1] https://cwiki.apache.org/confluence/display/FLINK/Flink+Translation+Specifications

To keep the current changes simple, it seems preferable to add anchor links only to the headings introduced in this PR. This helps maintain a minimal and focused scope of changes—adhering to the anchor-linking principle without expanding the modified content beyond what is described in the JIRA title.

WDYTA ?

@eemario
Copy link
Copy Markdown
Contributor Author

eemario commented Apr 20, 2026

Hi @RocMarshal ,
Thanks for the suggestion! Agreed — I've added anchor links to the headings introduced in this PR, following the translation specifications.

Copy link
Copy Markdown
Contributor

@RocMarshal RocMarshal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1.
Merging...

@RocMarshal RocMarshal merged commit d49eb62 into apache:master Apr 20, 2026
@RocMarshal
Copy link
Copy Markdown
Contributor

Hi, @eemario Could you help make a BP-PR for release-2.3 ? Thanks

@eemario
Copy link
Copy Markdown
Contributor Author

eemario commented Apr 20, 2026

Hi @RocMarshal ,
Thanks for the review and merge! The BP-PR is ready #27977.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants