C++ multithreading – an effective parallel for loop.

cropped-ai_logo.gif

C++ multithreading, a background.

C++ 11, 14 and 17 give us an effective multithreading library that enables us to use all the cores of our CPU but at a price.

The library allows us to create threads – a heavyweight concurrency resource and asyncs – lightweight concurrency resources that use a pool of threads to prevent the very high thread creation overhead – typically a million cycles or more.

In this example we shall be using asyncs which play nicely with competing threads.

In the past we would have used OpenMP or pthreads to implement parallel constructs but these are looking a little tired now and we want to stride boldly forward to a lightweight parallel future.

On the GNU compiler, the C++ threads library just wraps pthreads, so you have to be a little careful, but other libraries are implemented in modern C++ (ie. .hpp instead of .cpp files) which can avoid the call overheads, high build times and link problems that plague 90’s style C++ code and C code. With modern C++ and a single module build, we can expect sub-second build times even on some quite large projects and substantially better code generation.

One of the prices we pay for multithreading is the care we have to take over memory access. We cannot generally share objects between threads unless they have been especially written to handle multithreaded access. This is especially true if one thread writes an object and another reads it or if two threads write to an object at the same time. We can,  however, share two objects for reading.

Thread safe C++ code, an example.

class myclass {
public:
    // A const method is generally safe to call
    // provided the class is thread safe and provided it does
    // not alter data members indirectly.
    int method1() const {
        return var_;
    }

    // A non-const method is dangerous as two threads calling
    // it at the same time can write to the same member.
    int method2() {
        // we need to use an atomic variable (std::atomic) or a
        // mutex lock (std::mutex, std::unique_lock) here for
        // safety.
        var_++;
    }
private:
    // in modern C++, all data members must be private
    // otherwise it is impossible to guarantee data safety.
    // We also use the trailing underscore convention to avoid
    // clutter.
    int var_;
};

The example above shows the problems with data safety. Failure to observe these carefully will result in race conditions occuring where two threads write at the same time and the result is random depending on which one gets there first.

Atomics and Mutexes

std::mutex provides a simple mechanism to allow only one thread to access the data of a class at one time. We add a std::mutex object to our class and then we can use std::unique_lock to get a lock on the mutex. On some compilers this is very efficient and I would reccomend you to use this as a first resort in all classes that need protection from concurrency.

std::atomic provides a mechanism to concurrently modify a single variable, for example as an atomic increment when reference counting. std::shared_ptr uses this mechanism to protect object’s reference count from concurrent access. std::atomic also supports a raft of behaviour through the C++ memory model to make sure that loads and stores are properly ordered when passing data between threads. std::mutex is generally a lot safer to use but can be upgraded to std::atomic in carefully selected areas only when performance is likely to suffer.

In the Java world, multithreaded methods are made safe using the synchronized keyword. This causes extra code to be generated on the entry and exit of a function to enable only one thread to access the object at one time. The code generated is quite complex but can avoid the use of mutexes in most circumstances.

The costs of atomic variables can vary quite a lot between platforms. For example on the PS3, the cost of an atomic increment was around 35-70 cycles, this is similar on mobile platforms. On the X86 architecture it tends to be a lot higher at around 400 cycles depending on the architecture. Also X86 strictly orders all its loads and stores in the scoreboarding stages and in L2 cache and this has an overhead. This does make the X86 architecture safer to use, however, an allows less-skilled coders to get away with many things that would not be possible on mobile devices.

You can also write code to test statistically if race conditions are likely to occur. Some tools exist which do this automatically, but myself, I would prefer to add code myself to check for potential race conditions, for example by setting a regular member variable with the current thread ID and checking that two threads do not access non-const methods simultaneuously. In games the overhead of many GNU-like tools such as GPROF is likely to be too high to effectively test the game. I have not tested “thread sanitizer” in the LLVM suite, so answers on a postcard please.

The lambda

The great thing about modern C++ is the lambda function which enables us to embed mini-functions in our code that capture variables in the current scope automagically.

Lambdas are super-simple to generate. Note the semicolon after the closing } as the {} is part of an expression.

int a = 1;
auto mylambda = [int a](int b) { return a + b; };

We can then call this lambda as a function. (actually it is a class with a call method).

int x = mylambda(5);

This takes the value of a at the time the lambda was created (1) and adds it to the parameter b (5) giving x = 6.

In this case we capture the variable a as a value, but by using [&a] we can capture it by reference, so if we do:

int a = 1;
auto mylambda = [int &a](int b) { return a + b; };
a = 2;
int x = mylambda(5);

We will get x = 2 + 5 = 7 because the a has changed.

The parallel for loop

Finally, the promised item, the parallel for loop.

This example is from the Gilgamesh molecules example.

template <class F>
void par_for(int begin, int end, F fn) {
  std::atomic idx;
  idx = begin;

  int num_cpus = std::thread::hardware_concurrency();
  std::vector<std::future> futures(num_cpus);
  for (int cpu = 0; cpu != num_cpus; ++cpu) {
    futures[cpu] = std::async(
      std::launch::async,
      [cpu, &idx, end, &fn]() {
        for (;;) {
          int i = idx++;
          if (i >= end) break;
          fn(i, cpu);
        }
      }
    );
  }
  for (int cpu = 0; cpu != num_cpus; ++cpu) {
    futures[cpu].get();
  }
}

This uses a number of constructs from the C++11 threads library. std::atomic we have described earlier means that the varible idx is treated atomicly, so the expression idx++ uses special instructions to make sure that each thread gets a different, sequential value. Some threads will get values above end but as long as we do not overflow idx these will simply terminate the async that they are running in. std::async we have discussed before as the weapon of choice for running concurrent code. We would generally avoid using std::thread as this requires much larger setup costs, especially if we run our for loop more than once. The value std::thread::hardware_concurrency returns the number of concurrency resources that your CPU has. This is not the same as the number of cores. On my AMD FX, for example, this returns 8, the number of genuine cores. On Intel i7 cpus, it also returns 8 even though there are only four cores. This is because the two “hyperthreads” on each core count as a concurrency resource. The std::future<void> wraps a mutex and is used to synchronise the asyncs with the get() function.

The result is a replacement for for() that we can use all over the place in our code providing we have observed the rules above.

 par_for(
    10, 20,
    [](int idx, int cpu) {
       printf("task %d running on cpu %d\n", idx, cpu);
    }
  );

Note we use printf and not std::cout in parallel code, iostreams gets a bit confused as the mutexes it uses are around each element separated by << but in printf  the mutex is around the whole printf which separates out the lines correctly. Game programmers often tend to avoid using iostreams out of the box for this reason.

This will give some kind of result like:

task 11 running on cpu 0
task 10 running on cpu 1
...
task 19 running on cpu 7
task 18 running on cpu 6

Note that there is no guarantee which order the task and which cpu are run in.

Now if our task operates on separate parts of an array, for example, then no synchronisation primitives such as mutexes are needed in the lambda. We can also do some basic things like new and delete because these are wrapped in mutexes internally, although I would reccomend that no C++ programmer ever uses these outside of container libraries.

Also any variables on our stack are perfectly safe to use provided they are created in the labmda. A common mistake is to capture a std::vector into the lambda by mistake and push to it! This causes some considerable headaches I can tell. So avoid using the catch-all [&] version of the lambda for sanity’s sake. If you do want to push to a common vector, put a mutex around your push and push say a thousand items at a time for performance or store the values by indexing to an already sized array. Remember to reference your mutex in the lambda, don’t create several independent mutexes by mistake!

An extension to the parallel for is to create a vector of “contexts” for each CPU and pass these to the lambda. The contexts will be accessible to only one async and therefore do not need mutex protection. An example is to create an object pool in the context so that allocation can be done without a mutex.

It the distant past we used to make local copies of objects on the stack to prevent acciental writes. This is still quite valid today as creating copies, especially to aligned classes, can be very inexpensive. We can use move semantics to our benefit here by moving objects to the thread context, modifying them safely and moving them back to general storage.

We can also use thread local storage (TLS), although this is quite difficult if your program uses shared objects as each shared library needs its own TLS. I would strongly recommend using thread contexts instead.

If you have any questions, feel free to comment, or come on one of my training sessions in the City or elsewhere.

 

Leave a Reply

Your email address will not be published. Required fields are marked *