The R programming language
R is a language for statisticians that performs operations using vectors of numbers. It has its origins in S, another statistical language, and has hints of LISP about it.
At Atomic Increment, we are often required to perform operations on data supplied in R’s native binary format RDA and occasionally in RDS, the serialisation of a single R object. It is the most common bioinfomatics file format thanks to bioconductor, the suite of R routines for biological information.
The only real documentation for these formats is from the source code:
Because R is an academic project, the documentation and code quality are not up to commercial standards being an ecclectic mix of 80’s style C code and Fortran that will only build with one compiler on one platform. Writing from Hong Kong today, it is a bit like the Chunking Mansions of open source projects not to be rude to R afficionados as it works and is popular, just don’t look inside.
The data format comes in four flavours:
- Ascii Hex
- Binary (Platform endian dependent)
- XDR binary (Big endian)
The latter is the most common format and the one we will focus on here.
XDR is just shorthand for big endian binary and is documented here: https://tools.ietf.org/html/rfc1014
The R XDR data format
RDA is usually either GZIP or BZIP encoded, so you will need to unpack it to a buffer first before decoding, I suggest using Minizip, my compression library. The RDA format starts with “RDX2\n” followed by “X\n”. The RDS format only has the “X\n”. The ‘X’ in this case is for the XDR coded R file.
Next we have three int32 values which are version numbers of the R, the writer and the release number.
Next we have a single R object which may contain other R objects.
Each object starts with an int32 which encodes the flags for that object.
- Type: low 8 bits
- Is Object: 1 bit
- Has Attibute: 1 bit
- Has Tag: 1 bit
- Levels: top 20 bits
The Type determines the type of the R object. See Rinternals.h for details.
If the Type is a LISP-style object LISTSXP, LANGSXP, CLOSXP, PROMSXP, DOTSXP then “Has Attribute” and “Has Tag” indicate that we must read additional objects from the stream.
It is not clear from any documenation what “object” and “levels” mean.
The other types we are interested in are the data types as environments and other constructs are our of the scope of this document.
The CHARXSP, LGLXSP, INTXSP, REALXSP, CPLXSP data types contain vectors of up to 2^32-1 elements of char, int, int, double and pairs of doubles respectively.
The SYMXSP type is followed by another item, usually a string.
VECXSP and EXPXSP are vectors of up to 2^32-1 general objects.
For details of all of these read the source code of R_Unserialize() in serialize.c
I will probably make a modern C++ library for reading R data in the near future, other commitments withstanding. So watch this space. It may be some time before I fully understand the format but I can add to this post if popular demand requires it.