On 05/10/2010 03:54 PM, Steven Dake wrote:
Ya thats correct. The one consistency issue I can see occurs during
migration + failure of a node where membership may change at the same
time a new node takes on writing activities. If this occurs, a read
which would have been targeted at position 3, may now be targeted at
position 2, while the write has been replicated to 4, 5, 6 but not 2
yet. In this case, that new node may end up reading an out of sync
replica. I had thought of using LTS for this since its very simple and
I am familiar with the algorithm. Maybe there are other choices. The
D1HT membership algorithm is not immediate and completes in Ologn time
so there is some opportunity for inconsistency in the membership on the
various nodes resulting in a migrated node reading from a different
replica then the main writing replica.
I had to deal with this exact issue for CloudFS. Eventually I decided
to make the list of data servers for an object explicit, determined at
creation time and stored as object metadata instead of trying to
recalculate based on current membership. It's just too easy for the
recalculation scheme to yield an incorrect answer, not only when
membership changes but also when there are partitions, when tokens are
reassigned to distribute load, when different objects have different
replication/striping policies, etc. The explicit approach does mean an
extra lookup at open time, but avoids a whole bunch of nasty cases
thereafter and also makes debugging/auditing easier.