gms 3.3 spec

Table of Contents GMS Overview Eucalyptus & GMS Bootstrap & Host Discovery Host state information Host Joins & Leaves Network Partitions & Merges Implementation Gossip Problems Network Problems Protocol Problems Tradeoffs Network Capability Validation

GMS Overview

Eucalyptus uses ^[1] as a virtually synchronous^[2] Group Membership System (GMS) ^[3] to bootstrap and handle host membership in the system. Initially, isolated hosts are running a eucalyptus-cloud process and need to become a part of the distributed system. This act of joining the group then triggers a reconfiguration of the system, as a whole, based on what services are registered on the newly joining host (e.g., enable the Walrus service; sync the DB; etc.). Conversely, when the process is stopped/killed or the host shuts down a complementary leave of the group occurs for the host.

Lastly, various faults can impact a host's ability to be useful for processing requests. Host-level failure detection is driven by jgroups. Faults can cause a running eucalyptus-cloud process to become unreachable from other hosts in the system (a partition), unable to receive messages (e.g., null route, NIC failure, MTU problem), or cause hosting node to fail. For each of these the system must react to remove that host from the system, stop depending on the service which had been running there, and potentially promote services running elsewhere as replacements.

More background on group membership systems is available in "A History of the Virtual Synchrony Replication Model"^[4].

Eucalyptus & GMS

In Eucalyptus we depend upon the jgroups GMS to do a number of important things:

Bootstrap & Host Discovery

When starting the eucalyptus-cloud process we don't tell it anything about the rest of the system. Localized configuration of distributed state is intentionally avoided. Instead, the process uses IGMPv4 multicast to join a multicast group specific to that cloud installation. The effect is to store the state of the hosts in the system "in the network" (as opposed to on any particular host).

Host state information

As part of the GMS information we carry three bits along with each host in the system:

Has the host started up all the way yet? That is, is the web services stack up, etc.; can it respond to web-services requests.
Does the host run a database?
If it does run a database, is it synchronized?

Host Joins & Leaves

When a host joins or leaves a Eucalyptus system some kind of action results. Based on the above host state information the system will reconfigure itself appropriately:

Host w/o database joins the group: it will setup database connections to all the hosts which do run a database.
Host w database joins the group: there are two cases.
1. There is no other host with a database: the newly joined host will start up normally.
2. There is another host with a database: all hosts will block database operations, the newly joined host will synchronize from the other host with the database, it will mark its state as synchronized when that completes, then all hosts will setup connections to the new database.
Host w/o a database leaves the group: if the host was running an ENABLED service the system will try to perform a failover if a spare is available.
Host w/ a database leaves the group: everyone tears down their database connections to that host.

Network Partitions & Merges

It is the nature of networks that message delivery is not reliable. Devices fail, misconfiguration occurs, cables get pulled. When the network stops delivering messages which are going between two hosts w/in the Eucalyptus system a partition has occurred. The GMS will detect the faulty host and remove the host from group. This brings the system into a state with two independently operating regimes. Subsequent repairs to the network will then restore communication between the affected hosts resulting in a merge. At this point, the GMS will provide each of the hosts with a view of the system before the merge having at least two subgroups. Each host determines whether or not the merge would result in a potential inconsistency (e.g., when two CLCs are merged back together one of them must stop accepting writes).

Implementation

Eucalyptus uses jgroups ^[1] which is very configurable and in particular affects two characteristics of the system.

Host bootstrap is Discovery vs. Configured: When
Failure detection possible with Stateless vs. Stateful protocols:

The possible configurations are then:

Multicast/UDP: Discovered & Stateless
Unicast/UDP: Configured & Stateless
Unicast/TCP: Configured & Stateful
Gossip/TCP: Partially Configured & Stateful

Gossip

A configuration option is to use a gossip server to add new group members. The gossip server maintains a list of the current members. New members contact the gossip server, which then distributes updated membership information to the new member as well as the rest of the group.

Gossip servers are somewhat more complex to manage but work in cases where cluster membership changes regularly and multicast is not available. The additional complexity comes from needing to configure the system, but has the attribute of only needing to have the possible coordinator hosts (CLCs) configured.

That said, gossip server failures can cause problems with membership management.

Additionally, having multiple GossipRouters would be implied. This has the characteristic of allowing for "split-brain" scenarios where hosts are able to reach different CLCs (e.g., during a network partition).

Problems

Network Problems

Misconfigurations which make a host unreachable.
Misconfigurations which make a host reachable but it cannot respond.
Misconfigurations which make a host unreachable but it can send.
Multicast throttling or blocking.
MTU misconfigurations.
Null routing/black hole.
Stateful connection time outs.

Protocol Problems

Configured: Inconsistent host addresses on different nodes will lead to unpredictable system behavior. Identifying this as the root cause is difficult. Verifying this isn't possible when it is misconfigured. Changing the system requires they be changed consistently as new hosts are added or old hosts are removed across all hosts in the system.
Stateful: If a fault causes the process to be unreachable but the TCP sockets are not closed reacting involves waiting for the socket's keepalive timeout and this might take 2 hrs.

Tradeoffs

The impact of the above problems on system behaviour varies across configuration options.

Unicast and Gossip configurations: Asymmetric misconfigurations impact membership.
Only Multicast configurations: Throttling/blocking impacts membership.
Only TCP configurations: Stateful connection time outs, Null routing/black holes, and MTU misconfigurations impact failure detection and membership.

Notably, The the first 5 network problems can be tested for directly and remedied before a deployment. The last two (#6 and #7) can be tested for, but are also inherent in the underlying protocol and may occur w/o a system or network administration mistake to trigger them.