[LIVY-11] Enable HA support#212
Conversation
Codecov Report
@@ Coverage Diff @@
## master #212 +/- ##
============================================
- Coverage 68.12% 67.64% -0.48%
- Complexity 960 980 +20
============================================
Files 104 106 +2
Lines 5946 6068 +122
Branches 899 911 +12
============================================
+ Hits 4051 4105 +54
- Misses 1314 1378 +64
- Partials 581 585 +4
Continue to review full report at Codecov.
|
|
How does this pr related to pr #189 ? |
|
This pr tackles a different feature from the one in pr 189. We're mainly concerned with enabling active/standby HA in systems that can Livy Server instances on multiple nodes |
|
Do you need a proxy-like thing to find the latest active node? |
|
No its handled by apache curator. All the latest active node (the leader) is recorded in the Zookeeper State Store and is then known by all other servers who are coordinated via that state store |
| val haKeyPrefix = livyConf.get(HA_KEY_PREFIX_CONF) | ||
| val retryValue = livyConf.get(HA_RETRY_CONF) | ||
| // a regex to match patterns like "m, n" where m and n both are integer values | ||
| val retryPattern = """\s*(\d+)\s*,\s*(\d+)\s*""".r |
There was a problem hiding this comment.
would it be easier to have two config values?
retry_count and sleep_between_retries_ms
There was a problem hiding this comment.
The 2 configs should be paired together in the config for clarity I believe. Also we're following the example ins the ZooKeeperStateStore.scala
There was a problem hiding this comment.
Hi,
just FYI
I have moved this configuration to ZooKeeperManager in #189. Replaced with two separated configs.
Perhaps, we need to merge this PR after #189 or I need to refactor my code to use your approach if this PR will be merged first (but I think that two separated properties more clear then parsing one property) to don't do the same work (like configuration refactoring).
There was a problem hiding this comment.
Sounds good, we can adjust based on whichever PR is merged first.
|
Nice ! |
|
How do the clients know the latest active node? Do they need to ask apache curator? If so, do we need to update the client code? When people use a restful client(e.g. curl) to communicate with Livy, should they query zk every time? |
|
Another question is if the current leader crash and a new leader is elected, should the session state in the old leader be restored in the new leader? |
The code does not do the state synchronization related to the Session State. Is it a good choice to put it in zk? |
#189 implements Livy Server discovery to provide the ability to get current Livy Server address by client to use it in REST requests. There is Java/Scala API for it. If someone uses not Java world client then he should ask ZK to get the address. |
|
When HA mode is enabled, the following happens. Once new Livy Server starts, it uses Curator to connect to the Zookeeper state store to try to acquire leadership. The process will wait until it is able to acquire leadership. The leader latch library will automatically ensure that only 1 Livy Server among those that are connect to that state store will be elected Leader at any point. Once it acquires leadership, it starts the Livy Web Server in the same way that it's currently handled, recovering session state if it is configured and all other associated configurations. The client ideally doesn't need to do anything with respect to contacting the zookeeper server, as ideally you have a load balancer of some sort failover controller to eventually link the client to the actively running Livy instance. This solution is primarily concerned with ensuring that given a group of Livy Servers deployed in HA mode, exactly one of them will be active and running at any one time. |
This is already implemented previously. You can set livy.server.recovery.state-store to "zookeeper" and livy.server.recovery.state-store.url to zookeeper quorum. This PR mainly concerns the leader election part for the active-passive Livy Server. |
|
@jerryshao small ping for a review |
|
@jerryshao Another ping |
…to redirect to the corresponding address
|
@jerryshao pinging again |
|
@jerryshao another ping |
2 similar comments
|
Taking over the pull request with @RogPodge 's permission. A bug involving query omission on failover was fixed, along with a bit of minor refactoring and renaming. I'm not too familiar with the code review structure of this repo, but can someone take a look at the PR and its changes? (Pinging @jerryshao based on prior messages as well) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #212 +/- ##
============================================
- Coverage 68.12% 67.87% -0.26%
+ Complexity 960 856 -104
============================================
Files 104 105 +1
Lines 5946 6057 +111
Branches 899 909 +10
============================================
+ Hits 4051 4111 +60
- Misses 1314 1381 +67
+ Partials 581 565 -16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request has been automatically marked as stale because it has had no activity for at least 3 months. If you are still working on this change or plan to move it forward, please leave a comment or push a new commit so we know to keep it open. Otherwise, this PR will be closed automatically in about one month. Thank you for your contribution to Apache Livy! |
What changes were proposed in this pull request?
This pull request enables Active/Passive HA in Livy through the use of Zookeeper. The CuratorElectorService class coordinates leadership election between multiple Livy instances
https://issues.apache.org/jira/browse/LIVY-11
How was this patch tested?
unit tests
manual verification using 2 Livy instances