Game Engine, C 1, P 5, Multithreading?

Multithreading

The power of Vulkan can be best shown in modern multithreaded applications. Otherwise, it is just a version of OpenGL, that is harder to program for and doesn't lead to many benefits. For this reason, I set out to write the best multithreaded execution system for my engine. But where do I start?

"Types" of multithreaded execution

First, we need to find, what types of multithreaded execution are there. Keep in mind, that "multithreaded execution" is used here in a very broad way. You should not use, what is discussed here, in serious communication, because it may be wrong.

Processes

The first thing to take a look at is process-based execution. Each task is assigned to a separate process, they execute separately. Each process has at least a singe thread, assigned to it. Although this mostly comes down to using child processes, completely independent ones can also be created. If I were to draw an analogy, I would choose launching a separate program that does a specific task and terminates.

A process is in simple terms a program, that can be executed to achieve a specific task. In most OSs each process has a priority and are executed in order, that is influenced by this priority. Each process can have complex logic and multiple sub-parts, which we will discuss next. Processes have their own allocated memory and system variables. Without specific means, it is difficult to read memory of one process from another.

A good example of using child processes on Windows can be seen here. What about other operating systems? On Linux we can use fork and execl. On Android, there are different solutions, such as "Work Manager" and "Background Service". Unfortunately, I didn't have time to look for process-based execution on other platforms.

In conclusion, there are ways to achieve process-based execution on the platforms, that I tested. There are both pros and cons to this approach. Pros: processes can be run independently. When coded in a specific way, they can keep execution even when the main process crashes. This means a safer execution environment. They are scheduled by the process scheduler, which allows for execution even when other processes of our program are blocked. Cons: the APIs on different platforms are different and sometimes very clunky. Although they can be run independently, efficient communication between them isn't easy to achieve. They take a long time to be created, initialized and destroyed, so they are not suitable for in-frame execution.

What they are good at, however, is long tasks, that span many seconds. File downloading, code compiling, batch processing are all good examples that can (but don't have to) use separate processes. This is why, although we have to discard this method of execution for now, we shouldn't forget about its existence.

Threads

Threads are the main building blocks, on which our programs stand. If we don't want to dig deeper to CPU-specific features, this is the most bare-bones and simple way of running tasks in parallel. A thread acts as an entity within a process, waiting to be executed by the scheduler. A process can have multiple threads, that can be executed successively or in parallel.

No more than 1 thread can be executed by 1 CPU core at a time. Each thread has thread-local storage, that it can quickly access. Moreover, threads have their own call stack and registers. On most OSs, threads also have priorities, which influence their order of execution.

Let's move to their usage. There is a thread library in std, but it abstracts away some useful features, that we might want to use. As for Windows, there is an excellent overview of threads and their usage on Microsoft Docs. On Linux POSIX threads are used. They use a separate library too.

All in all, threads are the way to go in terms of in-frame tasks. Pros: threads are low-level and fast, as they aren't huge abstractions over CPU functionality. If you need to speed-up processing, you can use thread-local storage. Cons: there are no benefits in creating more threads, than your CPU core number. They will slow each other down, swap contexts and registers.

I want to use multiple threads for execution in my engine. However, I will surely have more than 4 tasks to execute (2 real CPU cores and 2 virtual ones), so I'll have to figure out, how to speed them up more and divide all those tasks between them.

Thread Pools

Threads are great and performant. However, when we have many tasks to execute, it becomes hard to constantly create and destroy threads, so they don't fight with each other for CPU time. Is there a solution for that? Meet thread pools! The idea behind them is simple.  There is a set number of threads in a pool. When you want to execute a task, you take a free thread from the pool and assign the task to it. When the task is finished, it returns to the pool.
Image from Wiki
 Thread pools are a great for executing multiple tasks on a set of threads. The speed may vary from implementation to implementation, but they are one of the most elegant ways of distributing tasks between threads. Most languages have thread pools in standard libraries under the hood. They manage asynchronous calls, stay under worker threads, etc.

On Windows, there is a Thread Pool API, that already has many pre-defined functions in a classic WinAPI fashion. On Linux, there doesn't seem to be any ready API, but implementing something like that isn't that hard.

To sum up, thread pools allow us to execute a lot of task on a fixed number of threads. Pros: the number of tasks can be much bigger than thread count, there is no overhead from having as many threads as CPU cores. Cons: all tasks have to be in a uniform format, it is hard to chain them together.

This solution suits a game engine well. There are actual toy engines that use it. However, I want to be able to chain tasks and make them wait for one another. This will help us with synchronizing everything and making tasks execute in parallel, but in a controlled order.

Fibers

Let's start with 4 tasks. 2 of them calculate physics, one renders the scene to screen and the last one records user input to a list. As you can guess, we have to calculate all of the physics before rendering. On the other hand, recording user input is not dependent on previous tasks. We have 2 threads. Now, let's simulate their execution.

Both threads start executing physics tasks. One thread finishes executing the task sooner than other and takes the rendering task. It can't proceed with executing the task, because the other thread hasn't yet finished its execution. What do we do now? We wait. What is waiting? Useless wasted CPU cycles. Can we prevent this? Yes! How? By switching to another task using fibers.

A fiber is a single unit of execution, that can be manually scheduled within a thread. So, if you think of a thread as a box, that stores a thread context with thread-local storage, etc., then the entity within the box that actually executes commands, using stuff from the box, is the fiber. Generally, if you don't mess with a thread, it has a single execution fiber. It has a few bits of data itself (fiber-local storage, a subset of registers, fiber data and stack), which it carries with itself, when executing or moving between threads.

How can we use fibers to prevent our thread pool from waiting on tasks? There is a hint in the image: we swap the waiting fiber within the thread with a different one. That way, when we have a pool of fibers in addition to the thread pool, we can keep using the same threads, but with different pieces of code to execute.

Now is the time to point out drawbacks and caveats. Most of synchronization tools, that we commonly use (mutexes, barriers, semaphores, etc.) use thread contexts [don't quote me on that], so the best bet for us, if we want to swap a fiber in a thread, waiting on a mutex, will be to replace all these with atomic locks and counters. It takes some getting used to, but this way we don't have to worry about deadlocking ourselves because of context switches.

What about the availability of fibers on different OSs? This is a tough question. There is no support for fibers out-of-the-box in C++20 as of now, so we have to stick to platform-dependent solutions. There is a Fiber API on Windows. There is no support for fibers in Android NDK (I may be wrong here) and the Windows and Linux APIs differ quite a bit. What's more, their speed and accessibility depend on the CPU type and manufacturer.

Will I be able to overcome these limitations in the future? You'll have to read next chapters to find out! You can do it here.

Comments