Here's version four with a swap from fixed-length integers to variable- length compressed integers which allow us to skip compression of the index (since the non-integer data is all uncompressable checksums). I've also added the uncompressed size of each chunk to the index to make it easier to figure out how much space to allocate for the uncompressed chunk.
+-+-+-+-+-+====================+=================+========================+ | ID | Checksum type (ci) | Header checksum | Compression type (ci ) | +-+-+-+-+-+====================+=================+========================+
+=================+=======+=================+ | Index size (ci) | Index | Compressed Dict | +=================+=======+=================+
+===========+===========+ | Chunk | Chunk | ==> More chunks +===========+===========+
(ci) Compressed (unsigned) integer - An variable length little endian integer where the first seven bits of the number are stored in the first byte, followed by the next seven bits in the next byte, and so on. The top bit of all bytes except the final byte must be zero, and the top bit of the final byte must be one, indicating the end of the number.
ID '\0ZCK1', identifies file as zchunk version 1 file
Checksum type This is an 8-bit unsigned integer containing the type of checksum used to generate the header checksum and the total data checksum, but *not* the chunk checksums.
Current values: 0 = SHA-1 1 = SHA-256
Header checksum This is the checksum of everything from the beginning of the file until the end of the index when the header checksum is all \0's.
Compression type This is an integer containing the type of compression used to compress dict and chunks.
Current values: 0 - Uncompressed 2 - zstd
Index size This is an integer containing the size of the index.
Index This is the index, which is described in the next section.
Compressed Dict (optional) This is a custom dictionary used when compressing each chunk. Because each chunk is compressed completely separately from the others, the custom dictionary gives us much better overall compression. The custom dictionary is compressed without a custom dictionary (for obvious reasons).
Chunk This is a chunk of data, compressed with the custom dictionary provided above.
The index:
+==========================+==================+===============+ | Chunk checksum type (ci) | Chunk count (ci) | Data checksum | +==========================+==================+===============+
+===============+==================+===============================+ | Dict checksum | Dict length (ci) | Uncompressed dict length (ci) | +===============+==================+===============================+
+================+===================+==========================+ | Chunk checksum | Chunk length (ci) | Uncompressed length (ci) | ... +================+===================+==========================+
Chunk checksum type This is an integer containing the type of checksum used to generate the chunk checksums.
Current values: 0 = SHA-1 1 = SHA-256
Chunk count This is a count of the number of chunks in the zchunk file.
Checksum of all data This is the checksum of everything after the index, including the compressed dict and all the compressed chunks. This checksum is generated using the overall checksum type, *not* the chunk checksum type.
Dict checksum This is the checksum of the compressed dict, used to detect whether two dicts are identical. If there is no dict, the checksum must be all zeros.
Dict length This is an integer containing the length of the dict. If there is no dict, this must be a zero.
Uncompressed dict length This is an integer containing the length of the dict after it has been decompressed. If there is no dict, this must be a zero.
Chunk checksum This is the checksum of the compressed chunk, used to detect whether any two chunks are identical.
Chunk length This is an integer containing the length of the chunk.
Uncompressed dict length This is an integer containing the length of the chunk after it has been decompressed.
The index is designed to be able to be extracted from the file on the server and downloaded separately, to facilitate downloading only the parts of the file that are needed, but must then be re-embedded when assembling the file so the user only needs to keep one file.