How do compression formats work?
Zip and LZMA (7zip) encode data in much the same way with the only difference that LZMA uses arithmetic coding (or range coding) rather than huffman coding. Both use LZ77, an encoding scheme that uses distance-length pairs to reference sequences earlier in the stream.
For example, if we wanted to encode:
to be or not to be.
We could send instead.
to be or not [12, 5].
Where the five characters (to be) twelve bytes ago can be stored as a single symbol.
We assign different numbers of bits to each symbol we send, so the letter ‘e’ which occurs frequently gets fewer bits as in morse code where it gets just a dot. Symbols which occur less frequently get more bits.
Zip always rounds the number of bits to a whole number called huffman coding. This is used in JPEG files and many other specialist coding formats.
LZMA is different, it can use a fractional number of bits to represent a code by using a binary fraction. This technique was known for many years but because of patent problems no one was able to use it in free sofware applications. Igor Pavlov circumvented the patent by using a slightly less efficient variation called range coding which was published before the arithmetic coding patent. Actually, the two are pretty much identical, as is using larger huffman codes to represent multiple symbols. LZMA also introduces some extra codes similar to “move to front” coding, more about MTF coding later, and uses an adaptive statistical model for the coding.
The odd-man-out is Bzip2 which uses the Burrows-Wheeler transform to sort all the substrings in the file into alphabetical order. So ‘hello’ becomes ‘ello’, ‘hello’, ‘llo’, ‘lo’, ‘o’ for example, ie alphabetical order. Through a cunning property of this sorted list, we need to only store the character that comes before each substring to reconstruct the original sorted list an the locations of all the substrings, smart! Because these prefix characters are highly correlated, it is easy to compress them using the MTF coding we touched on earlier. This predicts the next character will have occured recently and if it is correct, uses fewer bits to encode the character.
Now no one of these is necessarily better than the other. They are all quite naive schemes as they do not understand anything about the data they encode and have very simple statistical models.
We can enhance all of these in simple ways by knowing (or guessing) a little about the data we are compressing.
An example lies in the PNG image format which uses Zip to encode RGB bytes to represent the color of an image pixel. The problem is that if we just Zipped the RGB bytes we would not get very good compression because Zip, LZMA and Bzip2 all rely on data being repeated in the data stream and the same color almost never occurs twice in an image.
To improve this, the PNG file format uses the difference between the current color and the ones above and to the left. We know from images that the color only changes gradually and so we get very small numbers if we look at, say, the change in the red color component. Small number are good because they occur more fequently in the data and when we zip the result, we see an improvement in the data compression rate.
If this seems like magic, it is certainly not because the number of bits represents the difference between our model of the data, in this case our understanding of images, and the reality of the data. If we can predict what the data is going to look like, we can encode it in fewer bits. For example, suppose we encode a picture of a cross as a ‘1’ and a picture of a circle as a ‘2’ and anything else as RGB pixels, then if we see the cross then we can send the picture as a single byte.
Bioinformatics data is screaming out for a better compression method than GZip, BZip2 or LZMA.
For example, if I take an IDAT file from an illumina machine and GZip the data, I only get it down to 60% of its original size. This may seem good, but the data is highly regular, so with methods like the PNG technique we could do a lot better.
In the IDAT file, there are a number of arrays of four byte and two byte numbers. The four byte numbers represent the IDs of the samples and the two byte numbers represent the interesting quantities.
The IDs are easy to predict as they form a monatonic sequence where the numbers steadily increase. A simple method to predict these is to use differencing as in PNG and move-to-front as in BZip to predict the next number in the sequence. This technique could net us a considerable benefit in compressing these kinds of file.
Many bioinformatics files, such as VCF files, contain a vast amount of repettition of this kind and monotonic sequences galore. A Zip codec that coded monotonic seqences in ASCII and binary would be a hit here.
A bigger problem are data files like SAM, BAM, FASTA and FASTQ files. BAM files are especially gnarly as they have already been compressed with GZIP, which we know is not ideal.
SAM and BAM files (aligned output from the sequencer) contain monotonic seqences but the largest amount of entropy is in the “Phred” string which represents the quality of the input signal and this single datum is very hard to code. There is a format being developed called “CRAM” but this could also be improved considerably on peripheral inspection.
FASTA files representing genome reference data compress very well under schemes similar to BZip as they contain many repeating sequences. BZip itself is not ideal as it has a window of only 900k and uses Huffman coding which is poor for two bit data (ACTG), but it is easy to devise a better coding scheme.
FASTQ files represent unaligned data and also contain the pesky Phred strings. I have done some statistical analysis of the Phred data and have found many common features that could be exploited to compress it better. Also if we were willing to treat the Phred data like JPEG data and allow a few little errors, we could compress it much better. One important thing to note about FASTQ is that the order of the samples is not important. We can exploit this lossy parameter to, for example, sort the DNA strings into order, thus saving a lot of work for the aligner. The strings then become highly correlated and compress much better.
Bioinformatics files compress badly under existing compression tools and it is easy to write better tools that can exploit obvious statistical trends in the data. These tools could be generic and not limited to particular data. A single compression tool that recognised particular themes in the data would be much better than GZip or other generic tools.
Any new compression formats would be design with multithreading in mind so that we can seek to the data quickly and compress/decompress on many threads. Random accessing and indexing should be automatic features of any such tool. For example a VCF or GTF file, which is a giant CSV spreadsheet, can be indexed as well as compressed at the same time. Python and R plugins could be written to do database queries on all these format.