Neal, thanks for the feedback. After taking your comments into consideration, here's version 2.
+-+-+-+-+-+------------------+-+-+-+-+-+-+-+-+ | ID | Compression type | Index size | +-+-+-+-+-+------------------+-+-+-+-+-+-+-+-+
+==================+=================+ | Compressed Index | Compressed Dict | +==================+=================+
+===========+===========+ | Chunk | Chunk | ==> More chunks +===========+===========+
ID '\0ZCK1', identifies file as zchunk version 1 file
Compression type Type of compression used to compress dict and chunks
Current values: 0 - Uncompressed 2 - zstd
Index size This is a 64-bit unsigned integer containing the size of compressed index.
Compressed Index This is the index, which is described in the next section. The index is compressed without a custom dictionary.
Compressed Dict (optional) This is a custom dictionary used when compressing each chunk. Because each chunk is compressed completely separately from the others, the custom dictionary gives us much better overall compression. The custom dictionary is compressed without a custom dictionary (for obvious reasons).
Chunk This is a chunk of data, compressed with the custom dictionary provided above.
The index:
+---------------+======================+ | Checksum type | Checksum of all data | +---------------+======================+
+================+-+-+-+-+-+-+-+-+ | Dict checksum | End of dict | +================+-+-+-+-+-+-+-+-+
+================+-+-+-+-+-+-+-+-+ | Chunk checksum | End of chunk | ==> More +================+-+-+-+-+-+-+-+-+
Checksum type This is the type of checksum used to generate the checksums in the index.
Current values: 0 = SHA-256
Checksum of all data This is the checksum of the compressed dict and all the compressed chunks, used to verify that the file is actually the same, even in the unlikely event of a hash collision for one of the chunks
Dict checksum This is the checksum of the compressed dict, used to detect whether two dicts are identical. If there is no dict, the checksum must be all zeros.
End of dict This is the location of the end of the dict starting from the end of the index. This gives us the information we need to find and decompress the dict. If there is no dict, the checksum must be all zeros.
Chunk checksum This is the checksum of the compressed chunk, used to detect whether any two chunks are identical.
End of chunk This is the location of the end of the chunk starting from the end of the index. This gives us the information we need to find and decompress each chunk.
The index is designed to be able to be extracted from the file on the server and downloaded separately, to facilitate downloading only the parts of the file that are needed, but must then be re-embedded when assembling the file so the user only needs to keep one file.
I've been working on a C implementation of this spec, and came up with a few other changes. I think it's important to have a checksum of the index as well as the data as we want to be able to verify that the index is as expected before trying to parse it.
I've also added in the ability to use a different hash type for the chunk checksums versus the index checksum and overall checksum. The idea is that a weaker checksum may be chosen for the chunks to reduce the size of the index without weakening overall verification.
+-+-+-+-+-+---------------+================+------------------+ | ID | Checksum type | Index checksum | Compression type | +-+-+-+-+-+---------------+================+------------------+
+-+-+-+-+-+-+-+-+==================+=================+ | Index size | Compressed Index | Compressed Dict | +-+-+-+-+-+-+-+-+==================+=================+
+===========+===========+ | Chunk | Chunk | ==> More chunks +===========+===========+
ID '\0ZCK1', identifies file as zchunk version 1 file
Checksum type This is an 8-bit unsigned integer containing the type of checksum used to generate the index checksum and the total data checksum, but *not* the chunk checksums
Current values: 0 = SHA-1 1 = SHA-256
Index checksum This is the checksum of everything from this point until the end of the index. It includes the compression type, the index size, and the compressed index.
Compression type This is an 8-bit unsigned integer containing the type of compression used to compress dict and chunks.
Current values: 0 - Uncompressed 2 - zstd
Index size This is a 64-bit LE unsigned integer containing the size of compressed index.
Compressed Index This is the index, which is described in the next section. The index is compressed without a custom dictionary.
Compressed Dict (optional) This is a custom dictionary used when compressing each chunk. Because each chunk is compressed completely separately from the others, the custom dictionary gives us much better overall compression. The custom dictionary is compressed without a custom dictionary (for obvious reasons).
Chunk This is a chunk of data, compressed with the custom dictionary provided above.
The index:
+---------------------+-+-+-+-+-+-+-+-+======================+ | Chunk checksum type | Chunk count | Checksum of all data | +---------------------+-+-+-+-+-+-+-+-+======================+
+================+-+-+-+-+-+-+-+-+ | Dict checksum | End of dict | +================+-+-+-+-+-+-+-+-+
+================+-+-+-+-+-+-+-+-+ | Chunk checksum | End of chunk | ==> More +================+-+-+-+-+-+-+-+-+
Chunk checksum type This is an 8-bit unsigned integer containing the type of checksum used to generate the chunk checksums.
Current values: 0 = SHA-1 1 = SHA-256
Chunk count This is a count of the number of chunks in the zchunk file.
Checksum of all data This is the checksum of everything after the index, including the compressed dict and all the compressed chunks. This checksum is generated using the overall checksum type, *not* the chunk checksum type.
Dict checksum This is the checksum of the compressed dict, used to detect whether two dicts are identical. If there is no dict, the checksum must be all zeros.
End of dict This is a 64-bit LE unsigned integer containing the location of the end of the dict starting from the end of the index. This gives us the information we need to find and decompress the dict. If there is no dict, this must be a zero.
Chunk checksum This is the checksum of the compressed chunk, used to detect whether any two chunks are identical.
End of chunk This is the location of the end of the chunk starting from the end of the index. This gives us the information we need to find and decompress each chunk.
The index is designed to be able to be extracted from the file on the server and downloaded separately, to facilitate downloading only the parts of the file that are needed, but must then be re-embedded when assembling the file so the user only needs to keep one file.
On Wed, Feb 28, 2018 at 6:52 PM, Jonathan Dieter jdieter@gmail.com wrote:
I've been working on a C implementation of this spec, and came up with a few other changes. I think it's important to have a checksum of the index as well as the data as we want to be able to verify that the index is as expected before trying to parse it.
Hey,
I've just replied to this thread but forgot to include the In-Reply-To header so it's not showing here. Here's the message:
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
Sorry for the inconvenience.
Michal
On Mon, 2018-03-12 at 15:42 +0100, Michal Domonkos wrote:
Hi Jonathan,
To me, the zchunk idea looks good.
Incidentally, for the last couple of months, I have been trying to rethink the way we cache metadata on the clients, as part of the libdnf (re)design efforts. My goal was to de-duplicate the data between similar repos in the cache as well as decrease the size that needs to be downloaded every time (inevitably leading to this topic).
I came up with two different strategies:
- Chunking
<snip>
That made me think that either using git (libgit2) directly or doing a small, lightweight implementation of the core concepts might be the way to go. I even played with the latter a bit (I didn't get to breaking down primary.xml, though): https://github.com/dmnks/rhs-proto
In the context of this thread, this is basically what you do with zchunk (just much better) :)
Yes, I guess. ;) The git concept sounds interesting, but it will be a lot of work and will require some huge changes in how we deal with metadata. For the moment, I think I'll focus on more evolutionary changes.
- Deltas
Later, during this year's devconf, I had a few "brainstorming" sessions with Florian Festi who pointed out that the differences in metadata updates might often be on the sub-package level (e.g. NEVRA in the version tag) so chunking on the package boundaries might not give us the best results possible. Instead perhaps, we could generate deltas on the binary level.
My current tests were chunked on srpm boundaries, not package boundaries (not sure if we're using the same terminology here). The problem with chunking on the package boundary in the xml is that it creates many more chunks than grouping by srpm, and the smaller chunks hurt compression (even with a dictionary) and the larger number increase the size of the index. In Fedora, I think it's impossible to only change the metadata for one sub-package belonging to an srpm, and, even if it *is* possible, it's very rare.
An alternative would be to pre-generate (compressed) binary deltas for the last N versions and let clients download an index file that will tell them what deltas they're missing and should download. This is basically what debian's pdiff format does. One downside to this approach is that it doesn't give us the de-duplication on clients consuming multiple repos with similar content (probably quite common with RHEL subscriptions at least).
I'm not a huge fan of pre-generated binary deltas. They might give smaller deltas than a chunking solution, but at the cost of generating and maintaining the deltas. It's been enough of a pain for rpms that (at least on an individual basis) change infrequently. For metadata that changes every day, I think it would be too much.
The beauty of the zchunk format (or zsync, or any other chunked format) is that we don't have to download different files based on what we have, but rather, we download either fewer or more parts of the same file based on what we have. From the server side, we don't have to worry about the deltas, and the clients just get what they need.
Having said that, the current zchunk proposal doesn't really address deduplication from multiple repos with similar content. I suppose some way of extending zchunk to allow you to specify multiple local sources would be a way fixing that, but I'm still working on getting it to build from *one* local source (getting very close, but not quite there).
Then I stumbled upon casync which combines the benefits of both strategies; it chunks based on the shape of the data (arguably giving better results than chunking on the package boundaries), and it doesn't require a smart protocol. However, it involves a lot of HTTP requests as you already mentioned.
Just to be clear, zchunk is casync with smart boundaries (or, better put, application-specified boundaries) and a single file backend so requests can be sent as http range requests rather than hundreds of individual http requests.
Despite that, I'm still leaning towards chunking as being the better solution of the two. The question is, how much granularity we want. You made a good point: the repodata format is fixed (be it xml or solv), so we might as well take advantage of it to detect boundaries for chunking, rather than using a rolling hash (but I have no data to back it up). I'm not sure how to approach the many-GET-requests (or the lack of range support) problem, though.
If you've been following the conversation between myself and Michael Schroeder on rpm-ecosystem (starting with http://lists.rpm.org/pipermai l/rpm-ecosystem/2018-March/000553.html), I did some comparisons between zsync (which uses a rolling hash) and zchunk. There's also the fact that zsync uses gzip compression, while zchunk uses zstd by default, but quite a bit of the difference is due to the larger window size a rolling hash provides.
With zchunk, if you get a server with no range support, you just download the full zchunk file. I did run a check on the Fedora mirrors, and I think there were two or three that didn't support range requests at all.
As part of my efforts, I created this "small" git repo that contains metadata snapshots since ~February which can be useful to see how typical metadata updates look like. Feel free to use it (e.g. for testing out zchunk): https://pagure.io/mdhist
This looks great! I've got some random days in January and February, so this looks far more consistent.
Thanks much for your insight, Jonathan
Here's version four with a swap from fixed-length integers to variable- length compressed integers which allow us to skip compression of the index (since the non-integer data is all uncompressable checksums). I've also added the uncompressed size of each chunk to the index to make it easier to figure out how much space to allocate for the uncompressed chunk.
+-+-+-+-+-+====================+=================+========================+ | ID | Checksum type (ci) | Header checksum | Compression type (ci ) | +-+-+-+-+-+====================+=================+========================+
+=================+=======+=================+ | Index size (ci) | Index | Compressed Dict | +=================+=======+=================+
+===========+===========+ | Chunk | Chunk | ==> More chunks +===========+===========+
(ci) Compressed (unsigned) integer - An variable length little endian integer where the first seven bits of the number are stored in the first byte, followed by the next seven bits in the next byte, and so on. The top bit of all bytes except the final byte must be zero, and the top bit of the final byte must be one, indicating the end of the number.
ID '\0ZCK1', identifies file as zchunk version 1 file
Checksum type This is an 8-bit unsigned integer containing the type of checksum used to generate the header checksum and the total data checksum, but *not* the chunk checksums.
Current values: 0 = SHA-1 1 = SHA-256
Header checksum This is the checksum of everything from the beginning of the file until the end of the index when the header checksum is all \0's.
Compression type This is an integer containing the type of compression used to compress dict and chunks.
Current values: 0 - Uncompressed 2 - zstd
Index size This is an integer containing the size of the index.
Index This is the index, which is described in the next section.
Compressed Dict (optional) This is a custom dictionary used when compressing each chunk. Because each chunk is compressed completely separately from the others, the custom dictionary gives us much better overall compression. The custom dictionary is compressed without a custom dictionary (for obvious reasons).
Chunk This is a chunk of data, compressed with the custom dictionary provided above.
The index:
+==========================+==================+===============+ | Chunk checksum type (ci) | Chunk count (ci) | Data checksum | +==========================+==================+===============+
+===============+==================+===============================+ | Dict checksum | Dict length (ci) | Uncompressed dict length (ci) | +===============+==================+===============================+
+================+===================+==========================+ | Chunk checksum | Chunk length (ci) | Uncompressed length (ci) | ... +================+===================+==========================+
Chunk checksum type This is an integer containing the type of checksum used to generate the chunk checksums.
Current values: 0 = SHA-1 1 = SHA-256
Chunk count This is a count of the number of chunks in the zchunk file.
Checksum of all data This is the checksum of everything after the index, including the compressed dict and all the compressed chunks. This checksum is generated using the overall checksum type, *not* the chunk checksum type.
Dict checksum This is the checksum of the compressed dict, used to detect whether two dicts are identical. If there is no dict, the checksum must be all zeros.
Dict length This is an integer containing the length of the dict. If there is no dict, this must be a zero.
Uncompressed dict length This is an integer containing the length of the dict after it has been decompressed. If there is no dict, this must be a zero.
Chunk checksum This is the checksum of the compressed chunk, used to detect whether any two chunks are identical.
Chunk length This is an integer containing the length of the chunk.
Uncompressed dict length This is an integer containing the length of the chunk after it has been decompressed.
The index is designed to be able to be extracted from the file on the server and downloaded separately, to facilitate downloading only the parts of the file that are needed, but must then be re-embedded when assembling the file so the user only needs to keep one file.
infrastructure@lists.fedoraproject.org