On 05/10/2010 11:38 AM, Steven Dake wrote:
As there is only one writer in every case, a simple lamport time
stamp
would do the trick.
How is there only one writer? Either there's only one possible writer
globally, there's only one possible writer per object, or you have to
coordinate/reconcile between possible writers - potentially across
sites, despite partitions, etc. Nothing in the very brief description
I've seen implies either the first or second. The first is so
non-scalable it's beneath notice, so that's just as well. The second is
possible if you adopt a write-once model where the object ID has the
writer ID as a prefix and objects are immutable once written, but I
haven't seen any mention of that and in any case it leaves open the
question of how to know an object is fully written. In either the
initial-write or the migration case (which is specifically mentioned in
the requirements), it's necessary to define what will happen when
someone at one end of a slow/flaky network connection writes a block and
someone at the far end asks for it while it's still in transit, or what
happens when two nodes on opposite sides of a network partition - i.e.
lacking any means of coordination - attempt to write the same block.
Who blocks? Who fails? Who reconciles conflicts? Lamport clocks and
their relatives do not solve the problem by themselves. They merely
provide information that some other part of the system must use (often
using multiple clock vectors for multiple related pieces of data) to
arrive at a correct result.
The reason those full database systems you mention
have all that complicated consistency management is because they
potentially have multiple writers. With multiple writers, after a
failure during an update, a consistent view must be created out of the
mess. With one writer, the worst case that could happen is that *some*
of the blocks were not written.
This is not something that only happens under failure conditions. In a
*distributed system*, data might be absent locally even if it's present
somewhere else in the system, and an EOF before the actual end of the
file - as when the writer was notified of completion at a clearly prior
time - is incorrect data. The situation's even worse when the data are
being written to multiple destinations with no assurance of order
between them, and I've seen nothing so far that assures order.
This happens with normal storage in a
block system all the time during failure and is well tolerated by
journal filesystems.
Yes, by having a journal that's fully/sequentially/consistently written
before the live filesystem, so that it can be replayed accurately even
if the writes to the live filesystem are interrupted. What do you
propose other nodes should do if they ask for data that's sitting in the
journal of a failed/unreachable node? Would those blocks be unavailable
for either reading or writing until that node comes back up, or do you
propose to implement a global distributed journal with all of the
necessary consistency and ordering to enable recovery? At some point
the hard distributed-system problems have to be solved, not just pushed
into another component.