There are three basic issues that need to be addressed in the encryption
module: type of cipher used, initialization-vector handling, and
conflict management. Each is non-trivial, so I'll address them in turn.
= Cipher
The main factor affecting our choice of ciphers (or APIs to them) is
that we need to be able to deal efficiently with updates both in the
middle of the file and at the end. At EOF, the problem is that we need
a whole cipher-block in order to decrypt, but the file might actually
end at any byte boundary within that cipher-block. Therefore, we have
to deal with the "residue" somehow.
* Store the residue in an xattr.
* Store a whole cipher-block at the end, record the amount of padding in
an xattr.
* Use a stream cipher (or block cipher converted to a stream cipher).
This problem is further compounded by the striping case, where EOF for a
stripe component (local file stored on one brick) might not be EOF for
the entire file (union of all stripe components).
Since the two xattr-based approaches both require extra calls, the
stream-cipher approach has been used, with the cipher resetting at block
(e.g. 4KB) boundaries to allow efficient middle-of-file updates. As it
turns out, pure stream ciphers are relatively uncommon. More often,
CFB/OFB/CTR methods are used to convert a block cipher into a stream
cipher. The OpenSSL documentation is *amazingly* bad, but it looks like
it should be pretty easy to use any of these techniques with AES as well
as with DES.
= Initialization vector
Right now, the code uses a constant IV, which is totally unacceptable
from a security standpoint and was always meant to be changed before
release. The question is: what should we use for an IV? GlusterFS does
attach a supposedly unique "gfid" as an xattr on each file, so that
might be usable as a basis for the IV so long as we can verify that it's
universal and stable enough to be sure that data won't become
unrecoverable because a gfid is missing or changed.
= Conflict management
For partial-block writes, the encryption module needs to do the
following atomically.
* Read the current block contents.
* Decrypt.
* Overlay the new partial block on the old whole block.
* Encrypt.
* Write the entire block.
There's some additional complexity to do with EOF, but that's the basic
idea. The current code eschews locks in favor of "optimistic"
concurrency control in which a server-side "oplock" translator maintains
a generation number for each inode. Clients can start a "transaction"
before they read, associating the current inode generation with their
connection. The next write on that connection will compare the stored
generation number vs. the current one. If they're not the same, that
means there was another write since the transaction started, and the
write is rejected so the client can start over. Unfortunately, this
does not account for "self conflicts" when one client sends multiple
writes to the same file in parallel. The standard
performance/write-behind translator does this constantly, which is why
it has to be disabled when using cloudfs encryption, and there are many
other ways for it to happen.
My first inclination would be to add client code which detects and
avoids such self-conflict, but I have a sneaking suspicion that will be
pretty complex and have to be tweaked a lot to avoid compromising
performance. I kind of suspect that server-side queuing might be the
right answer here. If a transaction is begun which conflicts with
another already in progress, then the new one is simply queued behind
the old one and the transaction-begin call (actually a special setxattr)
will be resumed when the old ones complete. This also addresses
fairness/forward-progress issues inherent in both the locking and retry
models, though we'll need to put some thought into recovery from faults.