|
| 1 | +NetworkDB |
| 2 | +========= |
| 3 | + |
| 4 | +There are two databases used in libnetwork: |
| 5 | + |
| 6 | +- A persistent database that stores the network configuration requested by the user. This is typically the SwarmKit managers' raft store. |
| 7 | +- A non-persistent peer-to-peer gossip-based database that keeps track of the current runtime state. This is NetworkDB. |
| 8 | + |
| 9 | +NetworkDB is based on the [SWIM][] protocol, which is implemented by the [memberlist][] library. |
| 10 | +`memberlist` manages cluster membership (nodes can join and leave), as well as message encryption. |
| 11 | +Members of the cluster send each other ping messages from time to time, allowing the cluster to detect when a node has become unavailable. |
| 12 | + |
| 13 | +The information held by each node in NetworkDB is: |
| 14 | + |
| 15 | +- The set of nodes currently in the cluster (plus nodes that have recently left or failed). |
| 16 | +- For each peer node, the set of networks to which that node is connected. |
| 17 | +- For each of the node's currently-in-use networks, a set of named tables of key/value pairs. |
| 18 | + Note that nodes only keep track of tables for networks to which they belong. |
| 19 | + |
| 20 | +Updates spread through the cluster from node to node, and nodes may have inconsistent views at any given time. |
| 21 | +They will eventually converge (quickly, if the network is operating well). |
| 22 | +Nodes look up information using their local networkdb instance. Queries are not sent to remote nodes. |
| 23 | + |
| 24 | +NetworkDB does not impose any structure on the tables; they are just maps from `string` keys to `[]byte` values. |
| 25 | +Other components in libnetwork use the tables for their own purposes. |
| 26 | +For example, there are tables for service discovery and load balancing, |
| 27 | +and the [overlay](overlay.md) driver uses NetworkDB to store routing information. |
| 28 | +Updates to a network's tables are only shared between nodes that are on that network. |
| 29 | + |
| 30 | +All libnetwork nodes join the gossip cluster. |
| 31 | +To do this, they need the IP address and port of at least one other member of the cluster. |
| 32 | +In the case of a SwarmKit cluster, for example, each Docker engine will use the IP addresses of the swarm managers as the initial join addresses. |
| 33 | +The `Join` method can be used to update these bootstrap IPs if they change while the system is running. |
| 34 | + |
| 35 | +When joining the cluster, the new node will initially synchronise its cluster-wide state (known nodes and networks, but not tables) with at least one other node. |
| 36 | +The state will be mostly kept up-to-date by small UDP gossip messages, but each node will also periodically perform a push-pull TCP sync with another random node. |
| 37 | +In a push-pull sync, the initiator sends all of its cluster-wide state to the target, and the target then sends all of its own state back in response. |
| 38 | + |
| 39 | +Once part of the gossip cluster, a node will also send a `NodeEventTypeJoin` message, which is a custom message defined by NetworkDB. |
| 40 | +This is not actually needed now, but keeping it is useful for backwards compatibility with nodes running previous versions. |
| 41 | + |
| 42 | +While a node is active in the cluster, it can join and leave networks. |
| 43 | +When a node wants to join a network, it will send a `NetworkEventTypeJoin` message via gossip to the whole cluster. |
| 44 | +It will also perform a bulk-sync of the network-specific state (the tables) with every other node on the network being joined. |
| 45 | +This will allow it to get all the network-specific information quickly. |
| 46 | +The tables will mostly be kept up-to-date by UDP gossip messages between the nodes on that network, but |
| 47 | +each node in the network will also periodically do a full TCP bulk sync of the tables with another random node on the same network. |
| 48 | + |
| 49 | +Note that there are two similar, but separate, gossip-and-periodic-sync mechanisms here: |
| 50 | + |
| 51 | +1. memberlist-provided gossip and push-pull sync of cluster-wide state, involving all nodes in the cluster. |
| 52 | +2. networkdb-provided gossip and bulk sync of network tables, for each network, involving just those nodes in that network. |
| 53 | + |
| 54 | +When a node wishes to leave a network, it will send a `NetworkEventTypeLeave` via gossip. It will then delete the network's table data. |
| 55 | +When a node hears that another node is leaving a network, it deletes all table entries belonging to the leaving node. |
| 56 | +Deleting an entry in this case means marking it for deletion for a while, so that we can detect and ignore any older events that may arrive about it. |
| 57 | + |
| 58 | +When a node wishes to leave the cluster, it will send a `NodeEventTypeLeave` message via gossip. |
| 59 | +Nodes receiving this will mark the node as "left". |
| 60 | +The leaving node will then send a memberlist leave message too. |
| 61 | +If we receive the memberlist leave message without first getting the `NodeEventTypeLeave` one, we mark the node as failed (for a while). |
| 62 | +Every node periodically attempts to reconnect to failed nodes, and will do a push-pull sync of cluster-wide state on success. |
| 63 | +On success we also send the node a `NodeEventTypeJoin` and then do a bulk sync of network-specific state for all networks that we have in common. |
| 64 | + |
| 65 | +[SWIM]: http://ieeexplore.ieee.org/document/1028914/ |
| 66 | +[memberlist]: https://github.com/hashicorp/memberlist |
0 commit comments