Game Engine, C 1, P 7, Beginner multithreading

Difficulty of beginner implementation

I don't want to showcase me as an experienced senior programmer. However, I know some basic principles, that are hard to put into words, which will help me write better and faster code. However, I want to distance myself from them today and write something beginner-friendly and simple, so it is easier for me to explain and easier for you to understand.

Implementation itself

benchmark_single.cpp

The single-threaded implementation is just a loop to check all the numbers.
#include "benchmark_platform.h"

#include <thread>
#include <atomic>

static test_type core_count = (test_type)std::thread::hardware_concurrency();
static std::atomic<test_type> hit_count = 0;

void test_increment(test_type offset, test_type max)
{
 for (test_type i = offset; i < max; i += core_count)
 {
  if (isPrimeNumber(i)) ++hit_count;
 }
}

test_type perform_test(test_type max)
{
 hit_count = 0;
 for (test_type i = 2; i < core_count + 2; ++i)
 {
  test_increment(i, max);
 }
 return hit_count;
}

The atomic variable is not important here, but will be used in the future, so I decided to show it now. It is very important for multi-threaded applications, because it is the fastest way to do simple cross-thread info transferring without using mutexes, etc. It takes a single atomic operation to execute addition, subtraction, increment, etc. It is not prone to data damaging due to mid-operation context transfer.
I decided to break the loop into a number of pieces, that corresponds to the number of virtual cores or simultaneous threads on my PC. That is, for my remarkable Intel Core i3, I can have 4 threads running at the same time. I hard-coded 20'000'000 numbers to check and got 1270607 results. I didn't double-check if this is correct, but my focus was to achieve the same input and output with different libraries, which I did.

benchmark_multi.cpp

For a beginners test of multithreading, I decided to go with std::shared_future<void> and std::async. They do the job and are, in my opinion, the easiest to use. They are well suitable for UI. When I wrote code with Dear ImGUI and NanoGUI. I used lambdas and futures all the time. But for a game engine core loop, this just won't do. You'll see why in the results.
#include "benchmark_platform.h"

#include <thread>
#include <atomic>
#include <future>
#include <list>

static test_type core_count = (test_type)std::thread::hardware_concurrency();
static std::atomic<test_type> hit_count = 0;

const test_type task_count = 256;

void test_increment(test_type offset, test_type max)
{
 for (test_type i = offset; i < max; i += core_count)
 {
  if (isPrimeNumber(i)) ++hit_count;
 }
}

test_type perform_test(test_type max)
{
 hit_count = 0;
 std::list<std::shared_future<void>> futures;
 test_type incr = core_count * task_count;
 for (test_type i = 2; i < max; i += incr)
 {
  futures.clear();
  for (test_type j = 0; j < core_count; ++j)
  {
   test_type offset = i + j;
   futures.push_back(std::move(std::async(test_increment, offset, std::min(offset + incr, max))));
  }
  for (auto f : futures)
  {
   f.get();
  }
 }
 return hit_count;
}

So, you can see the code. We use a linked list to store the futures. We create them using asynchronous calls to test_increment, wait for them to finish execution and move to the next batch. Let's show the results and discuss the code in more detail.

Results

The results were expected. We gained a performance boost from the multithreading approach, but as we will see in later posts, this in not the best boost we could achieve.
Launch 1 Launch 2 Launch 3 Launch 4 Launch 5 Launch 6 Launch 7 Total
Single 18.0532 16.256 16.4315 16.4052 16.4346 16.0924 16.2453 16.3108
Multi 11.0562 9.60286 11.0142 11.0135 9.84951 9.2653 11.0359 10.2969

I decided to exclude the first launch from the total result, because it differed quite a bit.

Details

Now, let's talk about the details of implementation and what could be improved. There is not much to talk about regarding the single-threaded code. I decided to do a nested loop in order to program in at least some logic, because otherwise it would be just a simple for loop.

Multithreaded details

First of all, let's address the elephant in the room. Why am I using the async and future combination? From my experience, people who have recently gotten into multithreaded programming tent to lean to these. They are simple to use and lead to expected results. They also support lambdas and exceptions.

As for the linked list, it isn't the most performant solution, as far as I know. It allocates memory for every linked list node separately, but memory allocations are slow. I used it to show, that the number of task pieces submitted to our multithreaded job system can be quite varied. I think the best solution here would be to allocate all memory in advance and use it where needed as a fixed-size array or a paged linked list.

Waiting for all tasks to finish. The whole point of a job system is to asynchronously load tasks and execute new tasks on one thread while other threads are still working on old tasks. This is all great and this is the design choice for the libraries chosen, but it would take me a bit longer to write a good task queue solution with task dispatching or task pooling, so I decided to keep it like this. This will also show the pros and cons of using such an approach.

Pros and Cons of futures. The pros are, of course, that the task is being executed in parallel, which means that we are magically gaining additional computing power. The simple API is also an advantage. The main drawback, that I wanted to show, is the thread / context switching. It takes time to switch between threads. If you create a lot of them, they will not only get less CPU time but also spend a lot of time to switch between one another.

Race condition and atomic. Last thing to talk about is addressing the race condition. If you already know what this is, skip ahead. Race condition occures when the outcome of some operations depends on the order of the operations. Let's discuss a very simple processor on a lower (assembly) level. There are two numbers in memory: num1 and num2. The sequence of the commands to add two numbers can be: move num1 to register, add num2 to this register, store register with the result back to num1 memory. What if we have multiple threads running at a time? We can interrupt one command chain with another. Let's see:
Thread1: Reg = num1. Thread2: Reg = num3. Thread1: Reg += num2. num1 = Reg. Thread2: Reg += num4. num3 = Reg.
What is wrong here? EAX register was changed by thread 2 while thread 1 hasn't finished it's commands. Therefore, The results would differ from the expected ones. To battle this, people use thread synchronization mechanisms, such as mutexes and locks. An alternative to that is variables with atomic operations. They all do their job during a single command, so they can't be interrupted. This should be simpler and faster than mutexes, but this approach can't be applied everywhere.

Conclusion

I decided to switch to a bit more informative format of this blog, because I felt that for people new to the programming world these concepts would be too complex and experienced people won't find anything of interest here. So I'm trying to do both - to show some stuff for "veterans" and to explain a few bits to beginners. However, this is in no way a completely beginner-friendly blog.
To help you with studying, I suggest you become acquainted with how OSs work and other principles in adjacent fields. I suggest you read Modern Operating Systems by Andrew S. Tanenbaum. This was my guide, that I used during my studies. It explained a lot of things, that I didn't understand at the time and made appreciate the intricacies of OS design.
In the next blog post, I'll show you how to integrate Sewing and Marl libraries and benchmark them. Click here to read the post.

Comments