Can we do better than gzip, bzip or lzma for bioinformatics data?

cropped-ai_logo.gif

How do compression formats work?

Zip and LZMA (7zip) encode data in much the same way with the only difference that LZMA uses arithmetic coding (or range coding) rather than huffman coding. Both use LZ77, an encoding scheme that uses distance-length pairs to reference sequences earlier in the stream.

For example, if we wanted to encode:

to be or not to be.

We could send instead.

to be or not [12, 5].

Where the five characters (to be) twelve bytes ago can be stored as a single symbol.

We assign different numbers of bits to each symbol we send, so the letter ‘e’ which occurs frequently gets fewer bits as in morse code where it gets just a dot. Symbols which occur less frequently get more bits.

Zip always rounds the number of bits to a whole number called huffman coding. This is used in JPEG files and many other specialist coding formats.

LZMA is different, it can use a fractional number of bits to represent a code by using a binary fraction. This technique was known for many years but because of patent problems no one was able to use it in free sofware applications. Igor Pavlov circumvented the patent by using a slightly less efficient variation called range coding which was published before the arithmetic coding patent. Actually, the two are pretty much identical, as is using larger huffman codes to represent multiple symbols. LZMA also introduces some extra codes similar to “move to front” coding, more about MTF coding later, and uses an adaptive statistical model for the coding.

The odd-man-out is Bzip2 which uses the Burrows-Wheeler transform to sort all the substrings in the file into alphabetical order. So ‘hello’ becomes ‘ello’, ‘hello’, ‘llo’, ‘lo’,  ‘o’ for example, ie alphabetical order. Through a cunning property of this sorted list, we need to only store the character that comes before each substring to reconstruct the original sorted list an the locations of all the substrings, smart! Because these prefix characters are highly correlated, it is easy to compress them using the MTF coding we touched on earlier. This predicts the next character will have occured recently and if it is correct, uses fewer bits to encode the character.

Doing better

Now no one of these is necessarily better than the other. They are all quite naive schemes as they do not understand anything about the data they encode and have very simple statistical models.

We can enhance all of these in simple ways by knowing (or guessing) a little about the data we are compressing.

An example lies in the PNG image format which uses Zip to encode RGB bytes to represent the color of an image pixel. The problem is that if we just Zipped the RGB bytes we would not get very good compression because Zip, LZMA and Bzip2 all rely on data being repeated in the data stream and the same color almost never occurs twice in an image.

To improve this, the PNG file format uses the difference between the current color and the ones above and to the left. We know from images that the color only changes gradually and so we get very small numbers if we look at, say, the change in the red color component. Small number are good because they occur more fequently in the data and when we zip the result, we see an improvement in the data compression rate.

If this seems like magic, it is certainly not because the number of bits represents the difference between our model of the data, in this case our understanding of images, and the reality of the data. If we can predict what the data is going to look like, we can encode it in fewer bits. For example, suppose we encode a picture of a cross as a ‘1’ and a picture of a circle as a ‘2’ and anything else as RGB pixels, then if we see the cross then we can send the picture as a single byte.

Bioinformatics data

Bioinformatics data is screaming out for a better compression method than GZip, BZip2 or LZMA.

For example, if I take an IDAT file from an illumina machine and GZip the data, I only get it down to 60% of its original size. This may seem good, but the data is highly regular, so with methods like the PNG technique we could do a lot better.

In the IDAT file, there are a number of arrays of four byte and two byte numbers. The four byte numbers represent the IDs of the samples and the two byte numbers represent the interesting quantities.

The IDs are easy to predict as they form a monatonic sequence where the numbers steadily increase. A simple method to predict these is to use differencing as in PNG and move-to-front as in BZip to predict the next number in the sequence. This technique could net us a considerable benefit in compressing these kinds of file.

Many bioinformatics files, such as VCF files, contain a vast amount of repettition of this kind and monotonic sequences galore. A Zip codec that coded monotonic seqences in ASCII and binary would be a hit here.

A bigger problem are data files like SAM, BAM, FASTA and FASTQ files. BAM files are especially gnarly as they have already been compressed with GZIP, which we know is not ideal.

SAM and BAM files (aligned output from the sequencer) contain monotonic seqences but the largest amount of entropy is in the “Phred” string which represents the quality of the input signal and this single datum is very hard to code. There is a format being developed called “CRAM” but this could also be improved considerably on peripheral inspection.

FASTA files representing genome reference data compress very well under schemes similar to BZip as they contain many repeating sequences. BZip itself is not ideal as it has a window of only 900k and uses Huffman coding which is poor for two bit data (ACTG), but it is easy to devise a better coding scheme.

FASTQ files represent unaligned data and also contain the pesky Phred strings. I have done some statistical analysis of the Phred data and have found many common features that could be exploited to compress it better. Also if we were willing to treat the Phred data like JPEG data and allow a few little errors, we could compress it much better. One important thing to note about FASTQ is that the order of the samples is not important. We can exploit this lossy parameter to, for example, sort the DNA strings into order, thus saving a lot of work for the aligner. The strings then become highly correlated and compress much better.

The future

Bioinformatics files compress badly under existing compression tools and it is easy to write better tools that can exploit obvious statistical trends in the data. These tools could be generic and not limited to particular data. A single compression tool that recognised particular themes in the data would be much better than GZip or other generic tools.

Any new compression formats would be design with multithreading in mind so that we can seek to the data quickly and compress/decompress on many threads. Random accessing and indexing should be automatic features of any such tool. For example a VCF or GTF file, which is a giant CSV spreadsheet, can be indexed as well as compressed at the same time. Python and R plugins could be written to do database queries on all these format.

 

 

The R data format: Rdata, rda and rds

cropped-ai_logo.gif

The R programming language

R is a language for statisticians that performs operations using vectors of numbers. It has its origins in S, another statistical language, and has hints of LISP about it.

At Atomic Increment, we are often required to perform operations on data supplied in R’s native binary format RDA and occasionally in RDS, the serialisation of a single R object. It is the most common bioinfomatics file format thanks to bioconductor, the suite of R routines for biological information.

The only real documentation for these formats is from the source code:

https://github.com/wch/r-source/blob/trunk/src/main/serialize.c

Because R is an academic project, the documentation and code quality are not up to commercial standards being an ecclectic mix of 80’s style C code and Fortran that will only build with one compiler on one platform. Writing from Hong Kong today, it is a bit like the Chunking Mansions of open source projects not to be rude to R afficionados as it works and is popular, just don’t look inside.

The data format comes in four flavours:

  • Ascii
  • Ascii Hex
  • Binary (Platform endian dependent)
  • XDR binary (Big endian)

The latter is the most common format and the one we will focus on here.

XDR is just shorthand for big endian binary and is documented here: https://tools.ietf.org/html/rfc1014

The R XDR data format

RDA is usually either GZIP or BZIP encoded, so you will need to unpack it to a buffer first before decoding, I suggest using Minizip, my compression library. The RDA format starts with “RDX2\n” followed by “X\n”. The RDS format only has the “X\n”. The ‘X’ in this case is for the XDR coded R file.

Next we have three int32 values which are version numbers of the R, the writer and the release number.

Next we have a single R object which may contain other R objects.

Each object starts with an int32 which encodes the flags for that object.

  • Type: low 8 bits
  • Is Object: 1 bit
  • Has Attibute: 1 bit
  • Has Tag: 1 bit
  • Levels: top 20 bits

The Type determines the type of the R object. See Rinternals.h for details.

If the Type is a LISP-style object LISTSXP, LANGSXP, CLOSXP, PROMSXP, DOTSXP then “Has Attribute” and “Has Tag” indicate that we must read additional objects from the stream.

It is not clear from any documenation what “object” and “levels” mean.

The other types we are interested in are the data types as environments and other constructs are our of the scope of this document.

The CHARXSP, LGLXSP, INTXSP, REALXSP, CPLXSP data types contain vectors of up to 2^32-1 elements of char, int, int, double and pairs of doubles respectively.

The SYMXSP type is followed by another item, usually a string.

VECXSP and EXPXSP are vectors of up to 2^32-1 general objects.

For details of all of these read the source code of R_Unserialize() in serialize.c

I will probably make a modern C++ library for reading R data in the near future, other commitments withstanding. So watch this space. It may be some time before I fully understand the format but I can add to this post if popular demand requires it.

 

 

 

C++ multithreading – an effective parallel for loop.

cropped-ai_logo.gif

C++ multithreading, a background.

C++ 11, 14 and 17 give us an effective multithreading library that enables us to use all the cores of our CPU but at a price.

The library allows us to create threads – a heavyweight concurrency resource and asyncs – lightweight concurrency resources that use a pool of threads to prevent the very high thread creation overhead – typically a million cycles or more.

In this example we shall be using asyncs which play nicely with competing threads.

In the past we would have used OpenMP or pthreads to implement parallel constructs but these are looking a little tired now and we want to stride boldly forward to a lightweight parallel future.

On the GNU compiler, the C++ threads library just wraps pthreads, so you have to be a little careful, but other libraries are implemented in modern C++ (ie. .hpp instead of .cpp files) which can avoid the call overheads, high build times and link problems that plague 90’s style C++ code and C code. With modern C++ and a single module build, we can expect sub-second build times even on some quite large projects and substantially better code generation.

One of the prices we pay for multithreading is the care we have to take over memory access. We cannot generally share objects between threads unless they have been especially written to handle multithreaded access. This is especially true if one thread writes an object and another reads it or if two threads write to an object at the same time. We can,  however, share two objects for reading.

Thread safe C++ code, an example.

class myclass {
public:
    // A const method is generally safe to call
    // provided the class is thread safe and provided it does
    // not alter data members indirectly.
    int method1() const {
        return var_;
    }

    // A non-const method is dangerous as two threads calling
    // it at the same time can write to the same member.
    int method2() {
        // we need to use an atomic variable (std::atomic) or a
        // mutex lock (std::mutex, std::unique_lock) here for
        // safety.
        var_++;
    }
private:
    // in modern C++, all data members must be private
    // otherwise it is impossible to guarantee data safety.
    // We also use the trailing underscore convention to avoid
    // clutter.
    int var_;
};

The example above shows the problems with data safety. Failure to observe these carefully will result in race conditions occuring where two threads write at the same time and the result is random depending on which one gets there first.

Atomics and Mutexes

std::mutex provides a simple mechanism to allow only one thread to access the data of a class at one time. We add a std::mutex object to our class and then we can use std::unique_lock to get a lock on the mutex. On some compilers this is very efficient and I would reccomend you to use this as a first resort in all classes that need protection from concurrency.

std::atomic provides a mechanism to concurrently modify a single variable, for example as an atomic increment when reference counting. std::shared_ptr uses this mechanism to protect object’s reference count from concurrent access. std::atomic also supports a raft of behaviour through the C++ memory model to make sure that loads and stores are properly ordered when passing data between threads. std::mutex is generally a lot safer to use but can be upgraded to std::atomic in carefully selected areas only when performance is likely to suffer.

In the Java world, multithreaded methods are made safe using the synchronized keyword. This causes extra code to be generated on the entry and exit of a function to enable only one thread to access the object at one time. The code generated is quite complex but can avoid the use of mutexes in most circumstances.

The costs of atomic variables can vary quite a lot between platforms. For example on the PS3, the cost of an atomic increment was around 35-70 cycles, this is similar on mobile platforms. On the X86 architecture it tends to be a lot higher at around 400 cycles depending on the architecture. Also X86 strictly orders all its loads and stores in the scoreboarding stages and in L2 cache and this has an overhead. This does make the X86 architecture safer to use, however, an allows less-skilled coders to get away with many things that would not be possible on mobile devices.

You can also write code to test statistically if race conditions are likely to occur. Some tools exist which do this automatically, but myself, I would prefer to add code myself to check for potential race conditions, for example by setting a regular member variable with the current thread ID and checking that two threads do not access non-const methods simultaneuously. In games the overhead of many GNU-like tools such as GPROF is likely to be too high to effectively test the game. I have not tested “thread sanitizer” in the LLVM suite, so answers on a postcard please.

The lambda

The great thing about modern C++ is the lambda function which enables us to embed mini-functions in our code that capture variables in the current scope automagically.

Lambdas are super-simple to generate. Note the semicolon after the closing } as the {} is part of an expression.

int a = 1;
auto mylambda = [int a](int b) { return a + b; };

We can then call this lambda as a function. (actually it is a class with a call method).

int x = mylambda(5);

This takes the value of a at the time the lambda was created (1) and adds it to the parameter b (5) giving x = 6.

In this case we capture the variable a as a value, but by using [&a] we can capture it by reference, so if we do:

int a = 1;
auto mylambda = [int &a](int b) { return a + b; };
a = 2;
int x = mylambda(5);

We will get x = 2 + 5 = 7 because the a has changed.

The parallel for loop

Finally, the promised item, the parallel for loop.

This example is from the Gilgamesh molecules example.

template <class F>
void par_for(int begin, int end, F fn) {
  std::atomic idx;
  idx = begin;

  int num_cpus = std::thread::hardware_concurrency();
  std::vector<std::future> futures(num_cpus);
  for (int cpu = 0; cpu != num_cpus; ++cpu) {
    futures[cpu] = std::async(
      std::launch::async,
      [cpu, &idx, end, &fn]() {
        for (;;) {
          int i = idx++;
          if (i >= end) break;
          fn(i, cpu);
        }
      }
    );
  }
  for (int cpu = 0; cpu != num_cpus; ++cpu) {
    futures[cpu].get();
  }
}

This uses a number of constructs from the C++11 threads library. std::atomic we have described earlier means that the varible idx is treated atomicly, so the expression idx++ uses special instructions to make sure that each thread gets a different, sequential value. Some threads will get values above end but as long as we do not overflow idx these will simply terminate the async that they are running in. std::async we have discussed before as the weapon of choice for running concurrent code. We would generally avoid using std::thread as this requires much larger setup costs, especially if we run our for loop more than once. The value std::thread::hardware_concurrency returns the number of concurrency resources that your CPU has. This is not the same as the number of cores. On my AMD FX, for example, this returns 8, the number of genuine cores. On Intel i7 cpus, it also returns 8 even though there are only four cores. This is because the two “hyperthreads” on each core count as a concurrency resource. The std::future<void> wraps a mutex and is used to synchronise the asyncs with the get() function.

The result is a replacement for for() that we can use all over the place in our code providing we have observed the rules above.

 par_for(
    10, 20,
    [](int idx, int cpu) {
       printf("task %d running on cpu %d\n", idx, cpu);
    }
  );

Note we use printf and not std::cout in parallel code, iostreams gets a bit confused as the mutexes it uses are around each element separated by << but in printf  the mutex is around the whole printf which separates out the lines correctly. Game programmers often tend to avoid using iostreams out of the box for this reason.

This will give some kind of result like:

task 11 running on cpu 0
task 10 running on cpu 1
...
task 19 running on cpu 7
task 18 running on cpu 6

Note that there is no guarantee which order the task and which cpu are run in.

Now if our task operates on separate parts of an array, for example, then no synchronisation primitives such as mutexes are needed in the lambda. We can also do some basic things like new and delete because these are wrapped in mutexes internally, although I would reccomend that no C++ programmer ever uses these outside of container libraries.

Also any variables on our stack are perfectly safe to use provided they are created in the labmda. A common mistake is to capture a std::vector into the lambda by mistake and push to it! This causes some considerable headaches I can tell. So avoid using the catch-all [&] version of the lambda for sanity’s sake. If you do want to push to a common vector, put a mutex around your push and push say a thousand items at a time for performance or store the values by indexing to an already sized array. Remember to reference your mutex in the lambda, don’t create several independent mutexes by mistake!

An extension to the parallel for is to create a vector of “contexts” for each CPU and pass these to the lambda. The contexts will be accessible to only one async and therefore do not need mutex protection. An example is to create an object pool in the context so that allocation can be done without a mutex.

It the distant past we used to make local copies of objects on the stack to prevent acciental writes. This is still quite valid today as creating copies, especially to aligned classes, can be very inexpensive. We can use move semantics to our benefit here by moving objects to the thread context, modifying them safely and moving them back to general storage.

We can also use thread local storage (TLS), although this is quite difficult if your program uses shared objects as each shared library needs its own TLS. I would strongly recommend using thread contexts instead.

If you have any questions, feel free to comment, or come on one of my training sessions in the City or elsewhere.

 

How to build a sphere – its more difficult than you think!

Geodesic sphere from wikipedia

Gilgamesh

In my new geometry library Gilgamesh I have started with some real world examples from structural biology namely building solvent excluded and ball-and-stick representations of molecules. For the ball-and-stick models we need to have good low-polygon spheres as big complexes may have hundreds of thousands of atoms.

The sphere class allows you to add transformed and coloured spheres to a mesh. For examples, see basic_mesh and molecules. You can also supply your own lambda function to generate distorted versions of the sphere, for example for building models of moons.

https://github.com/andy-thomason/gilgamesh/blob/master/include/gilgamesh/shapes/sphere.hpp

Spheres have always been and interesting challenge to get right in a geometry library with common problems including uneven sized triangles and erratic mapping.

In the Octet OpenGL framework that I developed for my games students I had several geometric primitives such as spheres and cones. I wanted to present this more formally with a well balanced set of primitives represented by their own classes so that geometric operations such as ray tracing and Constructive Solid Geometry could be done without reference to the mesh class.

Many 3D editing packages offer a choice of Mercator-style spheres and Geospheres.

Mercator spheres divide the sphere into lines of lattitude and longitude using the sin and cosine of these two angles like a Mercator projection on a globe. The problem with these is that they use more triangles towards the poles and have problems if we want to map textures to the sphere.

Geospheres generally look nicer with smaller numbers of triangles. They use triangles that are approximately equilateral in a Buckminster Fuller style.

To construct a geosphere it is possible to inflate a near spherical primitive such as a dodecahedron and add extra triangles to increase curvature. This is what I do in the Octet library. Every triangle in the primitive is split into four or more new triangles with each new vertex extended to sit on the sphere. The problem with this is that the triangles become uneven and we need the coordinates of the dodecahedron to start with. There is an excellent dodecahedron wikipedia article that can help with this, however.

After toying with several methods of generating equally spaced triangles on a sphere I looked at the simple method of adapting the Mercator sphere to have roughly equilateral triangles.

First we have to choose the number of longitude subdivisions and divide Pi by this to get our triangle length. Now for each ring as we go down from the north pole, we generate n vertices where n is the number of lengths in that ring ie floor(2*Pi*r/length).

This gives us our vertices, now how to generate triangles? Each strip has a triangles at the top and b triangles at the bottom. These numbers are mostly different. We walk along the ring adding a triangle to whichever edge has the lowest angle value. This algorithm selects roughly equilateral triangles at every step and fulfils our requirements. One side effect of this is that we always get six triangles at the top and bottom of the sphere forming a hexagon.

The next task is how to UV map the sphere. We need to be careful in three places, at the poles and at the meridian or zero angle joinng the poles. At the poles, choosing one UV coordinate, say (0, 0) will result in distortion of the map, so we may choose to duplicate the vertices at the pole seven times and assign coordinates (0, 0), (1/6, 0) .. (1,0). Why seven and not six? This is related to the meridian mapping problem. We need the final triangle to go from ((n-1)/n, v) to (1, v) so that the texture wraps correctly at the seam. So we must duplicate one vertex at the meridian also. If you don’t understand this, try it yourself!

On unmapped spheres we will want to choose to not duplicate these vertices as they add an overhead. I intend to add this feature later.

Getting started with Gilgamesh.

Start by cloning Gilgamesh and its dependencies (glm and minizip).

git clone https://github.com/andy-thomason/gilgamesh
git submodule init
git submodule update

Make a build directory and use cmake to construct a project (makefiles or visual studio)

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..

Now you can build the examples by running Visual studio or make.

Happy meshing!

Minizip – a small zip library

I’ve recently split my zip decoder off from Octet (and Vookoo) and it has a project of its own.

It lives here: https://github.com/andy-thomason/minizip

Minizip is a tiny modern C++ library for decoding zip files and other file formats containing the Deflate compression method such as PNG.

It is going to be used for the Zegami image library project in Oxford by my student David Babera and so should get some testing.

There are two classes at present:


minizip::decoder
minizip::zipfile_reader

The decoder class takes a block of deflate encoded data and produces raw bytes. The zipfile_reader class manages a mapped zip file and enables extraction of data in parallel from Zip files.

I may move my bzip2 decoder into this project also to avoid duplication of effort. Zip files can optionally include Bzip2 compressed data.

I will also write a microscopic version of an LZMA codec if I get the time.

This is a basic example of use:


#include <minizip/decoder.hpp>
#include <minizip/zipfile_reader.hpp>

int main() {
zipfile_reader reader(if_zip, if_zip + sizeof(if_zip));

// Get a list of file names.
auto names = reader.filenames();

// Read the first file using the name as a lookup
std::vector<uint8_t> text = reader.read(names[0]);
std::cout.write((const char*)text.data(), text.size());

// Get a list of directory entries.
auto entries = reader.dir_entries();

// Read the first file using the directory entry.
std::vector<uint8_t> text2 = reader.read_entry(entries[0]);
std::cout.write((const char*)text2.data(), text2.size());
}

Getting OpenGL to run on a headless server

Today I have been looking for methods of running OpenGL – and ultimately Vulkan – on a headless server so that we can do thin client experiments on Vookoo, our vulkan wrapper library.

There is an excellent article here: http://renderingpipeline.com/2012/05/windowless-opengl/ on the subject which gave me some basic clues about running GLX, the interface to OpenGL on Linux machines.

I did a few little experiments and discovered that it was quite easy to run OpenGL examples such as glxgears http://linux.die.net/man/1/glxgears from an SSH session provided you had a working X server running and knew the display name.


$ ssh me@myserver.com
$ export DISPLAY=:1 # note: you should check this variable on your xterm
$ glxgears

And you should have glxgears (from the mesa utils) running on your remote desktop.

Of course, we don’t want to actually do this as you can’t see the result unless you are in the room with the server, but it does enable us to create framebuffers on our server, render to them and pump the result to an android app or web page.

Thin clients are likely to become more popular as internet data rates improve. The old HTML-based web model is likely to migrate to web pages that are merely windows on dynamic server generated content.

Our goal this summer at Goldsmiths College is to create a sample thin client VR system that pumps 3D content to a smartphone or desktop app so that multiple users can play a game or enjoy a visual experience on a single server.

We hope to be applying thin client technology to molecular dynamics (Bioblox) and art (Mutator).