The Brotli Compression format

Brotli, named after a swiss bakery product, is Google’s entry into the world of data compression. Like all things Google, it has obtained a great deal of attention in the computer science world, but is it all it is cracked up to be? To test the format, I added a decoder to my project andyzip so that I could learn all its secrets.

Andyzip

https://github.com/andy-thomason/andyzip

Andyzip is a header only C++ library for experiments in data compression. It enables single module builds of codecs such as PNG, RDA and others without the need to use legacy C libraries with associated building and linking pain and can be adapted for very high performance decoding of compressed files. If you want to build million plus line C++ projects on any platform in under a second, you need to do a single module build and for a single module build (only one .cpp file) you need to only use header only libraries.

Recent examples of use have been the Zegami project at Oxford university where andyzip (in its former guise as minizip) was used to decode gigabyte sized libraries of images in seconds. Prior to this you would have to use zlib and libjpeg, both of which are a little rusty, cryptic and unextendable.

Brotli

I used a combination of the brotli reference codec on github https://github.com/google/brotli and the new RFC document https://tools.ietf.org/html/rfc7932 to build an test my decoder. I hasten to add that it would probably take a few more weeks to verify my decoder, so don’t go off and use it on mission critical code as it is highly unlikely that I’ll get the time to do the required verification.

Brotli combines ideas from ZIP, BZIP2 and LZMA but mostly is an improved LZ77 coding scheme like ZIP. Many things are lifted straight from ZIP. The two stage huffman table encoding scheme where huffman code lengths are compressed by a huffman code is straight from ZIP as is the pesky little-endian bit coding sceme that requires a bit reverse or a table look up.

Other aspects of Brotli come from other coding schemes for example the ability to switch huffman tables (BZIP2 et al.) and distance history (LZMA).

One of the downsides of Brotli is that it uses Huffman codes which work well with entropic (random) data but poorly with small alphabets such as genetic data. The reason for this is that Huffman code lengths are rounded to integers unlike the arithmetic coding scheme of BZIP and LZMA which can have fractional length codes. So a code of five characters like ACTGN for example needs to round up from 2 bits to 3 bits wasting 50% of the potential size.

To mitigate some of the downsides of Huffman codes they combine several codes into a “command” code which incorporates a literal count, a copy distance and a copy length. This means that on average the codes are longer that 2 or 3 bits so the effect of rounding up the bitlengths becomes less: rounding up from 10 to 11 is only 10% loss for example.

The scheme is encoded as:

<command code><lit>...<lit><distance>

<command code><lit>....<lit><distance>

Where <lit> represents a literal such as ‘A’ or ‘+’ and <distance> represents the distance we must go back to copy some text. For example:

hello hello

Is represented as:

<6 lits, copy_len=5>'h' 'e' 'l' 'l' 'o' ' '<distance=6>

The first six chars are put in as literals and the second hello is copied.

To squeeze a tiny amount of more compression out of the format, the huffman tables for the command, literal and distance codes are changed from time to time using block type codes. Each block type, combined with a context defines which table is used.

For example

<6 lits, copy_len=5><block type=2, len=2>'h' 'e' <block type=3, len=4> 'l' 'l' 'o' ' '<distance=6>

Switches the literal block type to 2 and 3 respectively for 2 and 4 chars. In practice the block switch commands, which can work for commands and distances as well, only happen rarely.

The contexts are derived from the two previous characters in the case of literals and from the copy length in the case of distances. A context map enables the re-use of huffman tables.

Like LZMA, Brotli also has a longer window size that ZIP which enables longer backwards lookups and like some other compression formats has a pre-loaded dictionary for common literal sequences such as HTML tags and javascript snipp.ets. Like LZMA it can use very short (zero even) copy lengths which allows sequences like:

ENSG000123884, ENSG000124885, ENSG000125886,

to be encoded efficiently.

The dark side

So very good then, Brotli is a huge improvement on ZIP which is its spiritual successor but how does it match up to other schemes such as BZIP, BZIP2 and LZMA?

Its decode speed is pretty fast, which is better than LZMA, but the more complex algorithm will inevitably mean worse encoding performance. In compression terms it is quite far behind BZIP2 for text files. BZIP2/BZIP use the Burrows-Wheeler transform to re-permute the characters in the file and is hard to beat for text files.

Most importantly, like ZIP and BZIP2 the metablocks are not skippable as there is no uncompressed length encoded in the blocks and this means that it can be decoded only on one thread. This will seriously hamper its performance as we move on out of the single threaded code world. In the modern world we don’t use serial file reads, we map huge files into memory with mmap and crunch them in parallel so if you have a 32 core machine, which you will in a year or two, Brotli will be 32 times slower than parallel formats. However, there are no such formats in the public domain at present that I know of. It would be easy to make one but very hard to get one accepted, especially in the Linux world which is resistant to multi-threading.

So all in all, well done Brotli in improving on ZIP, but it would have been nice to lose the huffman codes, use the BWT and allow parallel decoding.