On 05/07/2010 07:34 PM, Steven Dake wrote:
> 3) Consider the client->storage protocol carefully. Will the
client
> access the vinzvault data store via TCP/IP network, or something faster
> (RDMA or proprietary virtualization bus)? If traversing a TCP/IP
> network, I recommend using a standardized storage protocol like iSCSI.
>
> In such a scenario, vinzvault could be implemented as a plug-in for the
> SCSI target project[1], thereby eliminating a large amount of the client
> work (kernel already has an iSCSI client). I also have my own iSCSI
> target daemon[2], but it is less mature than STGT.
>
Initial prototype to use something like tcp/ip - we can always worry
about binary compatibility later once we have something functional. The
advantage of tcp/ip is a well known programming model.. I expect over
time we will settle on something other then tcp/ip because of latency
properties of tcp/ip...
It might be worth looking at how existing parallel filesystems such as
PVFS2 and GlusterFS have already implemented such a dual TCP/IP and RDMA
abstraction.
> This is a key design decision. Pros: If a vinzvault client is
also a
> server, one reduces latency and number of total data copies. Cons: If
> a vinzvault client is also a server, the client must dedicate CPU and
> storage resources to unrelated processes, and potentially be subject to
> unanticipated, large resource loads.
This is a key issue for cloud deployments. Users already complain about
the high variability in I/O performance in e.g. AWS or Rackspace, due to
contention with other guests on the same host. If we extend the
possibility for such interference to guests on other hosts as well, that
would make an existing problem worse in the very environment that is
supposedly the project's raison d'etre.
Initially planning to make all d1ht members targets of the storage
area.
Longer term there is alot of flexibility here, for example introducing
'fake' indexes into the hash ring to allow one node to carry more
storage, or making a node 'invisible' to the membership of the d1ht.
These "fake" indices are very much an integral part of most ring-based
distributed stores since Dynamo (e.g. Voldemort or Riak but not Cassandra).
Because the membership protocol of d1ht is so scalable, we have alot
of
flexibility here to provide fake and invisible nodes in the membership
to help distribute load to nodes we want. For example, we could "lock"
a membership id which would make it "invisible" in the d1ht membership
that resolves hash table lookups. We could provide some special
modifiers to membership ids to allow them to proceed and follow the ring
ids to allow one node to carry more data.
Other possibilities are modifiers based upon location to allow
replication in multiple data centers.
I strongly recommend researching how this is already being done.
Cassandra has modular partitioners, replication policies, and "endpoint
snitches" to address this, though I believe they only work well as a
disaster-recovery solution and not for truly active-active situations.
I somewhat prefer the async/ordered queue approach used by EnterpriseDS
(Riak's for-pay big brother) for that. Multi-DC replication is not
something one can just bolt on later, IMO. The concerns that it
introduces wrt distribution and ordering/synchronization, even within an
eventually-consistent system, need to be considered as those
fundamentals are being worked out, even if the actual multi-DC
implementation is deferred.
> 4) Locality, locality, locality. A distributed hash table is
fantastic
> for widely distributing data across nodes -- but such a wide
> distribution can become a problem. Inefficiency and latency from
> overhead increases dramatically if a client must contact 1,000 nodes to
> read 1,000 512-byte sectors, for example.
>
Ya locality is an interesting problem here. One issue is that if one
node has 1 terabyte of data it wants to store, and only 250gb of storage
area. Have to balance redundancy, load distribution, and locality.
Unfortunately I don't have a real clear picture of how to solve the
locality problem. Likely require a pretty non-obvious solution...
This is where the various Dynamo-derived projects have done a lot of
work. With ring-based consistent hashing, it's fairly easy to shift
load by adding/removing/changing the tokens on the ring assigned to each
node; Cassandra, Riak and Voldemort all use variants of this technique,
as does my cloud fileystem. Experience with both those systems and with
parallel filesystems on their "home ground" using both TCP/IP and RDMA
also leads me to believe that striping a single object across too many
servers - let's say eight or so - very rarely turns out to be beneficial
performance. More often it just increases the difficulty of handling
failures or implementing the admission/flow control that always become
necessary in such systems.
Before we get too far into inventing supposedly new solutions to old
problems, I really think it behooves us to make sure that we've
adequately learned from others' experience in this complex area. Here
are some links to information that any serious practitioner should
already be familiar with before starting a new project that's likely to
repeat others' past mistakes.
General background:
Brewer's CAP Theorem:
http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
Julian Browne on CAP:
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
me on CAP:
http://pl.atyp.us/wordpress/?p=2521
me on multi-DC replication:
http://pl.atyp.us/wordpress/?p=2824
Filesystems:
PVFS2:
http://www.pvfs.org/
GlusterFS:
http://www.gluster.org/
OceanStore:
http://oceanstore.cs.berkeley.edu/ (also check out Bayou
links from there)
Tahoe-LAFS (
http://tahoe-lafs.org/trac/tahoe-lafs)
my CassFS (not the same as CloudFS):
http://github.com/jdarcy/CassFS
The Big Three of CAP-aware storage:
Amazon's Dynamo:
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Google's BigTable:
http://labs.google.com/papers/bigtable.html
Yahoo's PNUTS:
http://research.yahoo.com/project/212\
The Big Three of open-source Dynamo descendants:
Cassandra:
http://cassandra.apache.org/
Voldemort:
http://project-voldemort.com/
Riak/EnterpriseDS:
http://basho.com/
I've had varying degrees of involvement with the authors for all of
these, from casual conversation to sponsorship or co-development, so I
can help point out the particular pieces of history or documentation or
field experience that are most relevant to this project if you'd like.
This is also an area of active work by many companies (e.g. Mezeo,
Zetta, Permabit, Cleversafe, MaxiScale, Nasuni) and intense patent
trolling, so I suggest extreme caution in that regard as well. Whatever
one might think of software patents, they are a current legal reality
which folks like Mitchell Prust and Network Backup Corporation have made
particularly relevant in the area of distributed storage. Having to
defend against an infringement suit would be bad, as would failure to
ensure that our own new ideas are properly disclosed and protected (e.g.
via
http://www.patentcommons.org/) to prevent subsequent
enclosure/capture by folks who are just waiting to capitalize on such
lapses. If we want our work to remain open, we must proactively ensure
that result and not just assume it will happen by itself.