On Jul 14, 2011, at 5:01 PM, Edward Shishkin wrote:
In this case we store a file in parts which are not adjacent
from the standpoint of Cloudfs. That said we need to split
read, and this makes this operation inatomic. This means
that read(2) will return data compound of parts of different
"versions".
Example:
Suppose we have a file F stored in 2 different parts F1 and F2.
Process A writes a file F (to be of version 1);
Process B reads a file F (part F1);
Process C writes a file F (to be of version 2);
Process B reads a file F (part F2);
As the result process B returns data compound of
parts of different versions 1 and 2.
This non-atomicity is different from the non-atomicity that takes
place in the kernel (local file systems): kernel guarantees
that all PAGE_SIZE reads with PAGE_SIZE-aligned offsets are
atomic (this is because reads and writes in kernel acquire
page locks). Whereas, in our case we'll have that F2 doesn't
necessarily have PAGE_SIZE-aligned offset.
That said it can happen that we'll get complaints from users,
who don't expect such non-atomicity.
The users should never see this non-atomicity, so those user
complaints are mythical. All accesses to a physical file on one brick
go through a single translator which can (and should) enforce any
necessary serialization.
Moreover, in the case when
EOBs are HMACs for checking integrity, or authentication we'll
have false positives, as nobody guarantees that versions of HMAC
and respective data block will coincide.
Solution:
In this approach we need to serialize truncates, appending
writes and sequences RbRe (read block, read EOB).
Approach 2: Storing in file's body.
In this case EOBs are stored in file's body (via appending to
a file in the case of EOF, or interspacing a file with HMACs,
etc). So file with his EOBs is the whole from the standpoint
of Cloudfs, and there is no problems with atomicity specific
to Approach 1.
However, in this case all our files maintained by low-level
local fs will have increased sizes (added total size of all EOBs).
So that actual file size must be stored as additional attribute
(e.g. as xattr value).
So we're back to being non-atomic, since the actual write and the
xattr operation to set the EOF marker are separate. Note also that
POSIX does allow partial writes, so we can't really issue the xattr op
until we have the *result* from the write in hand. It does no good to
say that the EOF can be stored in memory, because that would allow the
inconsistency to become persistent in case of a crash. Every time EOF
changes, the change must be recorded persistently. This will happen
for every write that extends the file, in contrast to approach 1 in
which an xattr update is only necessary if the old and/or new EOF is
unaligned.
With HMACs (out of scope for this version of CloudFS anyway) there's
another problem. We can read or write intermingled data and HMACs as
one contiguous extent easily enough, but converting between this form
and the one the user expects will require allocation of a second
buffer and piecemeal copying of pieces from one to the other. That's
not exactly free, even on fast processors, especially when GlusterFS
tends to become CPU-bound on 10GbE or better already. Avoiding seeks
is a noble goal, and this approach might do that for small requests on
a single spinning disk with no caching, but there are other scenarious
- large requests, SSDs, multiple disks, warm/non-volatile caches -
where it would be notably worse than separate data and HMAC areas. We
need to evaluate the effect for different configurations and access
patterns carefully, not just make guesses without evidence. I
strongly suspect that the separate-HMAC approach would actually serve
us better, but even as the project leader I wouldn't presume to treat
my guesses as fact.
->open() method of the high-level translator loads actual
file size to the cloudfs-specific part of inode via fetching
->getxattr(), so that it is persistent in the memory on server.
Any ->truncate() and appending ->write() of the high-level
xlator update in-core and on-disk actual sizes simultaneously
"Simultaneously" has no meaning here. The on-disk update is issued
"downward" (toward the disk) and is completed asynchronously some time
later. The only way for the in-memory update to be *effectively*
simultaneous is if we block all other accesses while the on-disk
update is in progress. This is exactly the kind of serialization and
careful sequencing of sub-operations that you identify as a necessity
for approach 1, but it turns out that it's necessary for approach 2 as
well. The need for serialization does not help us distinguish between
the two approaches, because they have that in common.
P.S. This doesn't really belong on gluster-devel, since it's purely a
CloudFS issue.