After thinking about some of the problem scenarios people had mentioned in my first RFC (thank you), I've devised a new version which uses a much more robust conflict resolution and repair component. The bad news is that the proposal is a lot more complex than before. The good news is that most of the complexity is stuff that was already part of the plan for doing asynchronous wide-area replication in a future release, so doing this stuff now actually gives us a head start on that. As always, please comment freely so we can get this right and make things better sooner rather than later.
# HekaFS Improved Replication #
## Background and Requirements ##
One of the most serious internal complaints about GlusterFS is performance for small synchronous requests when using their filesystem-level replication (AFR). This problem particularly afflicts virtual-machine-image and database workloads, reducing performance to about a third of what it "should" be (compared on a per-server basis to NFS on the same hardware). The fundamental problem is that the AFR approach to making writes crash-proof involves the following operations:
1. Lock on the primary (first) server
2. Record operation-pending state (using extended attributes) on all servers
3. Issue write to all servers
4. As writes complete, update operation-pending state on other servers
5. Unlock on primary server
Even with some operations in parallel, this requires a minimum of five network round trips to/from the primary server - possibly more as step 4 might be repeated if there are more than two replicas. Even with pending changes to AFR, such as coalescing step 4 updates, AFR's per-request latency is likely to remain terrible.
Externally, users seem to focus on a different problem: the timeliness and observability of replica repair after a server has failed and been restored[1][2]. AFR was built on the assumption that on-demand repair of individual files or directories as they're accessed would be sufficient. The message from users ever since has been unequivocal: leaving unknown numbers of unrepaired files vulnerable to a second failure for an indefinite period is unacceptable. These users require immediate repair with explicit notification of return to a fully protected state, but here they run into a second snag: the time required to do a full xattr scan of a multi-terabyte filesystem through a single node is also unacceptable. Patches were submitted almost a year ago[3] to implement precise recovery by maintaining a list of files that are partially written and might therefore require repair, but those have never been adopted. The recently introduced "proactive self heal" functionality is only slightly better. It is triggered automatically and runs inside one of the server daemons - avoiding many machine to machine and user to kernel round trips - but it's still single-threaded and drags all data through one server that might be neither source nor destination. Worse, if a second failure occurs while the lengthy repair process for a previous failure is still ongoing, a new repair cycle will be scheduled but might not even start for days while the previous repair scans millions of perfectly healthy files.
The primary requirements, therefore, are:
* Improve performance for synchronous small requests
* Provide efficient "minimal" replica repair with a positive indication of replica status
In addition to these requirements, compatibility with planned enhancements to distribution and wide-area replication would also be highly desirable.
## Proposed Solution ##
The origin of AFR's performance problems is that it requires extra operations (beyond the necessary N writes) in the non-failure case to ensure correct operation in the failure case. The basis of the proposed solution is therefore to be optimistic instead of pessimistic, expending minimal resources in the normal case and taking extra steps only after a failure. The basic write algorithm (which will be extended later in this document so don't get too attached to it) becomes:
1. Forward the write to all N replicas
2. If all N replicas indicate success, we're done
3. If any replica fails, add information about the failed request (e.g. file, offset, length) to journals on the replicas where it succeeded
4. As part of the startup process, defer completion of startup until brought up to date by replaying peers' journals
Because the process relies on a journal, there's no need to maintain a separate list of files in need of repair; journal contents can be examined at any time, and if they're empty (the normal case) that serves as a positive indication that the volume is in a fully protected state.
Doing repair as part of the startup process means that, if the failure is a network partition rather than a server failure[4], then neither side will go through the startup process. Each server must therefore initiate repair upon being notified of another server coming up as well as during startup. Journal entries are pushed rather than pulled, from the servers that have them to the newly booted or reconnected server. Each server must also be a client, both to receive peer-status notifications (which currently go only to clients) and to issue journal-related requests.
In the case of a network partition, a second problem also arises: split brain. Writes might continue to be received and entered into the journal on both sides of the partition. When journal entries are being propagated in both directions between two servers, establishing the correct combined order for writes that overlap would require additional information (e.g. version vectors) not currently present in the GlusterFS network protocol. Part of the solution to this is to enforce quorum as has already been suggested[5] and implemented for AFR. Only a client which can contact a majority of servers, or exactly half with the first server as part of the set, will even attempt a write. The other part of the solution to network-partition problems is the same as for wide-area replication: vector clocks and a robust conflict-resolution method. These will be described in a subsequent section.
Although the description so far has mostly concentrated on writes, other modifications - e.g. create, symlink, setxattr - mostly work the same way. In the case of namespace operations followed by data operations - e.g. rename followed by write - ordinary care must be taken to ensure that the second operation is applied to the correct object. In the worst case, we might need to store UUIDs in the journal and use a UUID-to-path mapping maintained on each server (which would be useful for other reasons).
Full scans must still be available as a fallback in case of a total brick failure, to repopulate it or its replacement from scratch.
## Vector Clocks ##
The conflict resolution protocol exists not only to handle server and network failures, but also transient conflicts that can occur when two (or more) clients write to two (or more) servers in different orders. For example, consider what happens if events regarding the same byte range occur in this order:
1. Client A sends a write ("lmnop") to servers X and Y
2. Client B sends a write ("rstuv") to the same two servers
3. Client A's write is performed at server X
4. Client B's write is performed at server Y
5. The other two writes complete
Now server X has "lmnop" and server Y has "rstuv" - a permanently inconsistent state since both writes seem complete (to their issuing clients). To avoid this sort of insanity, we use simple vector clocks[6]. Each server maintains its own "clock" (integer sequence number) and the "vector clock" (really a clock vector) is an array of all such clocks for all replica servers. Every write or conflict-resolution message is accompanied by the vector clock from its sender, which might cause some clock values to be updated on the receiver. (Yes, this means the low level RPC protocol must be extended.) Lastly, we define vector clock M as being later than vector clock N if and only if:
* For every component clock, M's value is *at least* N's
* At least one of M's clock values is *strictly greater* than N's
If these rules are followed, the vector clocks provide enough information to resolve most conflicts automatically. For truly simultaneous operations at different places, it's possible that the later-than relation described above does not apply in either direction. In these cases, some tie-breaker must be applied to make the comparison result deterministic again. To fulfill this role, a unique ID for the client attempting a write is also passed along with all messages related to that write.
## Advanced Write Protocol ##
The full write protocol uses vector clocks to deal with the various conflict and fault-recovery cases in three ways.
* Clients send their current vector clock, indicating the expected clock values at each server, along with each write. Each server checks the clock value contained in the write against their own current clock value. If there's a mismatch, indicating a write at the server that the client didn't know about, then the new write is rejected as conflicting.
* Servers maintain a journal of writes whose full disposition is currently unknown. These writes can be forwarded to other servers immediately to handle conflict, or when a server comes back up to handle faults.
* If a client's write fails everywhere (for conflict or other reasons) this means it had no effect anywhere and can simply be retried. If it succeeded at some servers and failed at others, then the client can tell one or more of the succeeding servers to forward the write immediately to the failing servers. As part of this write-forwarding process, vector clocks are used again to determine the order in which conflicting writes will be applied.
The full write sequence is therefore as follows.
1. Client sends a write to all servers, with vector clocks.
2. Each server compares the write's vector clock to the server's own, possibly rejecting the write as conflicting.
3. Each server also adds the write to its local journal.
4. If the client's write succeeds everywhere, the client can complete it. In addition, a "cleanup" message is sent to each replica to retire the write's journal entry. This can be done asynchronously so that it doesn't increase write latency.
5. If the client's write fails everywhere - meaning that it had no permanent effect at all - it can simply be retried.
6. In mixed success/failure cases, the client sends "forward" messages to servers where the write succeeded, directing them to forward the write to other servers where it failed.
7. A server receiving a forwarded write uses the contained vector clock to determine order relative to items in its own journal. The incoming write is either partially or fully applied according to outcome of this ordering process.
8. The server receiving the forwarded write replies to the sender, which replies to the client. The forwarding server's journal is cleaned up along the way, as its disposition elsewhere becomes known.
Step 7 is really the key. By comparing vector clocks (with client IDs as tie-breakers), applying some writes partially and others completely, even two servers forwarding writes to one another converge on a single set of file contents. To see this in action, let's expand a little on the example in the previous section.
1. Initial data on both servers is "AAAA" with version 0.
2. Client A sends a write for "BBB_" (where "_" means that character is not being written) to both servers with vector clock {0,0}.
3. Client B sends a write for "_CCC" to both servers with vector clock {0,0}.
4. Server X receives and accepts A's write. It now has data "BBBA" and vector clock {1,0}. It also has a journal entry for a write from A at {1,0}.
5. Server Y receives and accepts B's write. It now has data "ACCC" and vector clock {0,1}. It also has a journal entry for a write from B at {0,1}.
6. The other two writes (A->Y and B->X) fail due to version conflict.
7. Client A, seeing the mixed result, sends a "forward" message to X.
8. Server X forwards the write (client=A, clock={1,0}, data="BBB_") to Y.
9. Server Y receives the forwarded write and compares the clock to that for its own journaled write. Since {1,0} and {0,1} are not resolvable, the client ID tie-breaker is applied. A's write is before B's, so it's only applied partially. This yields "BCCC" with a merged version vector of {1,1}.
10. Replies propagate back from Y to X to A. X's journal entry is retired, and A indicates completion of the write.
11. Client B, also seeing a mixed result, sends a "resolve" message to Y which forwards the write (client=B, clock={0,1}, data="_CCC") to X.
12. Server X receives the second forwarded write. This time the tie-break rule has the opposite effect, causing the forwarded write to apply in its entirety. X now has data "BCCC" with merged version {1,1} - the same as Y.
13. Replies propagate from X to Y to B, etc.
The same protocol also works and yields consistent results even if the two resolutions overlap, as has been verified by model checking[7].
## Consistency Issues ##
In the absence of a "cleanup" message, the same write forwarding with the same conflict-resolution rules can also be applied in response to a server restarting or becoming reachable after a network partition. In all cases, a write remains in a server's journal until that server is *sure* it has been fully forwarded and resolved at all other servers within its replica set.
In the server-unavailable case, if multiple overlapping writes occur at one of the surviving servers, the metadata-journaling approach can introduce a transient sort of inconsistency. Consider the following sequence.
1. Original state: file X=AAA, file Y=BBB
2. Write CC_ to file X (contents = CCA)
3. Write DDD to file Y (contents = DDD)
4. Write _EE to file X (contents = CEE)
If the journal contains only metadata, then the first will be forwarded to other servers (using data from the file) as CE_. This could result in a state of CEA on the destination server - i.e. part of the third write has effectively been applied before the second. Since these writes might have been fully acknowledged on the first server, this state - which the application might have carefully avoided on the first server by sequencing its writes - is invalid. There are four ways to deal with this.
A. Accept it, and write applications so that they don't rely on this level of consistency.
B. Do full data journaling for every write. This approach is simple, but has the worst performance characteristics of the consistency-preserving options.
C. Store the contents of the first write directly in the file, journal full data for later writes. This requires two disk writes for each user write during the failure, then a read plus a write to apply the journal during recovery. It also has terrible read performance, as reads must consult an arbitrary number of journal entries to find the current contents of a byte range.
D. Copy on write, to move overlapped data from the live file into the journal before it's overwritten. This requires a read plus two writes during the failure (worse than the previous option) but no additional I/O to replay the journal. Read performance is good because only the live file needs to be consulted for current data.
Based on this analysis, D is clearly preferable to B or C. Some users might still prefer A for its performance-during-failure characteristics, so that should also be available as an option.
## Notes ##
[0] No, this document isn't really proper Markdown. Close enough.
[1] "Experience with GlusterFS" http://www.devco.net/archives/2010/09/22/experience_with_glusterfs.php
[2] "Why GlusterFS is Glusterfsck'd Too" http://chip.typepad.com/weblog/2011/09/why-glusterfs-is-glusterfsckd-too.htm...
[3] http://bugs.gluster.com/show_bug.cgi?id=2088
[4] Yes, partitions do occur even in a local network environment.
[5] http://bugs.gluster.com/show_bug.cgi?id=3533
[6] http://wiki.basho.com/Vector-Clocks.html (includes further references)
[7] See e.g. http://www.cs.utah.edu/formal_verification/Murphi/ for background, though this effort did not use Murphi. 391,368 states (after symmetry breaking) were checked for the two-client two-server scenario described above.
cloudfs-devel@lists.fedorahosted.org