The Discovery Algorithm ======================= A few notes about the discovery algorithm used in cluster breakup to determine the best fully-connected subset of surviving machines for forming the new cluster. We have a value function, and we want to achieve three things: * Decide on which nodes has the best value function; * Propagate all connectivity maps to that node; and * Propagate that node's cluster reconfiguration decision to the rest of the nodes, even those which are only indirectly connected to it. Consider the network of nodes/value functions: Example D1: Node: A-------B-------C-------D-------E Value: 4 1 3 2 5 The nodes B and D both see themselves as local minima for the value function and both see a local maximum nearby. How do we propagate the appropriate value functions across the network, without using an algorithm which introduces too much traffic in a still-fully-connected network? Whatever the algorithm, we need to have a termination condition. We decide to terminate once a node is able to determine that it is the Elector, ie. that it has the best value function anywhere in the graph. We can only make this decision at any node once that node: * Posesses the value functions for every point in the graph, and * Is sure that it has seen the entire graph. To know that, it must have seen the connectivity maps for every node in the network, and must have checked that every neighbour mentioned in each of those maps has also provided a map. Every node knows, at all times, the state of each of its Links to neighbours and the names of those neighbours, so no special communications step is needed at a node to determine the node's own connectivity map. Assume that we start the discovery algorithm by, on each node, telling each of our neighbours what our value function is. Each node then sends its local connectivity map to the neighbour with the best value function. In a fully connected cluster, the value exchange is O(N**2) and the map exchange is O(N). That's OK for now --- the RESET traffic when we do a cluster abort is O(N**2) anyway, and the value packets are small, so this is only a small incremental cost over the RESET traffic. Now, we want to prevent any further unnecessary traffic in the common cases where the network is largely connected. We decide that only nodes which are best-so-far --- nodes whose connectivity maps indicate that it has the best value of any nodes seen so far --- may progress the discovery algorithm further. They do so by identifying "edge" nodes in the transitive closure of the locally-held connectivity maps --- nodes which are mentioned as neighbours in the known connectivity maps but whose own maps are not yet known. A best-so-far node proceeds by sending out routed "ping" messages to those edge nodes, using the already-known connectivity maps to provide the route. The ping messages include the sender's value function and the list of nodes which the message must pass through on the way to the edge node. On receiving such a packet, an edge node (identified because we have reached the last hop in the route) sends back its local connectivity graph (plus any other connectivity maps it may happen to have). Every other node on the route just increments the hop count in the packet. We maintain the full route in the packet when we pass it, to allow the reply also to be routed. For every such ping packet, each node updates its local copy of the "highest value seen" value function by maximising it with the sender of the ping. Any ping with a lower value function is dropped. As a result, if two locally-maximal nodes (A and E in example D1 above) both start the flood-fill edge expansion algorithm, intermediate nodes (like C above) will eventually detect the "better" of the two nodes (E), and will stop A's ping packets from propogating further. Each cycle of edge expansion proceeds by the best-so-far node pinging all edge nodes and waiting for all responses. The ping dropping described in the previous paragraph ensures that this process gets terminated whenever any node detects that some locally-maximal best-so-far node is actually not the best node in the cluster. The entire procedure terminates once a best-so-far node has no more edge nodes, and by this time we know for sure that it has found the entire connectivity graph. Other locally-maximal nodes (like A above) may still be waiting for ping responses, but when the newly-determined Elector starts the next phase of the cluster transition, those nodes will break out of that state in response to a reconfigure command from the Elector.