On Fri, 2010-05-07 at 15:41 -0400, Jeff Garzik wrote:
On 04/27/2010 07:36 PM, Steven Dake wrote:
> Some of our requirements:
> * Easy to use, deploy, and manage.
> * 100,000 host count scalability.
> * Only depend on commodity hardware systems.
> * Migration works seamlessly within a datacenter without SAN hardware.
> * VM block images can be replicated to N where N is configurable per VM
> image.
> * VM block images can be replicated to various data centers.
> * Low latency block storage access for all VMs.
> * Tuneable block sizes per VM.
> * Use standard network mechanisms to transmit blocks to the various
> replicas.
> * Avoid multicast.
> * Ensure only authorized host machines may connect to the vinzvault
> storage areas.
> * No central metadata server - everything is 100% distributed.
Neat project. There is a definite niche for vinzvault in the open
source-o-sphere.
Comments, in no particular order:
1) On the client side, block storage access should be via kernel block
device driver. This ensures a minimum number of data copies, as well as
maximum level of OS integration. Of course, that likely limits
vinzvault to Linux-only, unless people want to step up and write
non-Linux drivers.
Yes I agree a kernel module is a big win for the less memory copy
advantages. Initially I have started out with a fuse filesystem because
it is very simple to prototype. I expect our first prototypes of this
work will use fuse simply because it is easy to use. The downside of
fuse is all the extra memory copies which could be very expensive over
the long term for a vm. Progression I see here is fuse->kernel driver
and possibly a qemu specific block driver.
2) How "close" is the client to the D1HT? By that, I mean,
will a
vinzvault client also participate fully in the D1HT, and provide some
amount of local storage to its peers? Or will there be a clear
distinction between vinzvault clients and vinzvault servers?
This is a key design decision. Pros: If a vinzvault client is also a
server, one reduces latency and number of total data copies. Cons: If
a vinzvault client is also a server, the client must dedicate CPU and
storage resources to unrelated processes, and potentially be subject to
unanticipated, large resource loads.
Initially planning to make all d1ht members targets of the storage area.
Longer term there is alot of flexibility here, for example introducing
'fake' indexes into the hash ring to allow one node to carry more
storage, or making a node 'invisible' to the membership of the d1ht.
Because the membership protocol of d1ht is so scalable, we have alot of
flexibility here to provide fake and invisible nodes in the membership
to help distribute load to nodes we want. For example, we could "lock"
a membership id which would make it "invisible" in the d1ht membership
that resolves hash table lookups. We could provide some special
modifiers to membership ids to allow them to proceed and follow the ring
ids to allow one node to carry more data.
Other possibilities are modifiers based upon location to allow
replication in multiple data centers.
3) Consider the client->storage protocol carefully. Will the
client
access the vinzvault data store via TCP/IP network, or something faster
(RDMA or proprietary virtualization bus)? If traversing a TCP/IP
network, I recommend using a standardized storage protocol like iSCSI.
In such a scenario, vinzvault could be implemented as a plug-in for the
SCSI target project[1], thereby eliminating a large amount of the client
work (kernel already has an iSCSI client). I also have my own iSCSI
target daemon[2], but it is less mature than STGT.
Initial prototype to use something like tcp/ip - we can always worry
about binary compatibility later once we have something functional. The
advantage of tcp/ip is a well known programming model.. I expect over
time we will settle on something other then tcp/ip because of latency
properties of tcp/ip...
4) Locality, locality, locality. A distributed hash table is
fantastic
for widely distributing data across nodes -- but such a wide
distribution can become a problem. Inefficiency and latency from
overhead increases dramatically if a client must contact 1,000 nodes to
read 1,000 512-byte sectors, for example.
Ya locality is an interesting problem here. One issue is that if one
node has 1 terabyte of data it wants to store, and only 250gb of storage
area. Have to balance redundancy, load distribution, and locality.
Unfortunately I don't have a real clear picture of how to solve the
locality problem. Likely require a pretty non-obvious solution...
RE code, I am working on a simple fuse FS and tools to convert qemu
images to the fuse db (which use a DB backing store). I mostly have
this working ;) and hope to automake-ize it soon to seed the source
tree.
Regards
-steve