Game Engine, C 1, P 8, Sewing and Marl

Info

You can take a look at the libraries here: Sewing, Marl.
Image from Pixelbay
Sewing is a bit simpler, focusing on C11 compatibility. It uses Boost.Context to manage fibers and builds on top of it with a one-time memory allocation scheme. It operates using a fixed-size task pool with functions and arguments as pointers. It means that in order to operate it you need to use additional structures, where you put all of the parameters for the task functions. Sewing uses Boost 1.0 License.
Marl uses C++11, has a few more features, like events and queues. It has both fixed-size and scalable pools, which allows for more flexibility. It operates on C++ functions from STL and lambdas. Marl uses Apache 2.0 License (which I use too).

Original expectations 

Lambdas vs raw function pointers

During my studies in the university, I was told time and time again that lambdas were worse than function pointers in many regards. Combined with STL functions, they operate as a layer of indirection compared to raw function pointers, so they should be, right? In reality, it is often not the case. I'll go through the arguments, but you can read a bit more here and check a benchmark here.

Size

Raw function pointer is obviously a pointer. It takes up the usual pointer size depending on the system. For me it's 4 or 8 bytes depending on the compiler settings (32- or 64-bit). Lambdas are a bit more complicated. They are objects that can use different amount of space. If a lambda doesn't capture any variables, it takes up a single byte to represent the state of itself, so it is very memory-efficient. If it captures, then it takes up std::min(sum_of_sizeof_captures, threshold). When it takes up more, it reverts to using dynamically-allocated memory.
std::array<int32, 2> arr;
auto t = [arr](int32 toAdd)
{
 for (int i=0;i<arr.size();++i)
 {
  arr[i]+=toAdd;
 }
 return toAdd;
};

//converts to

struct lambda1918247
{
private:
 std::array<int32, 2> arr;
public:
 lambda1918247(std::array<int32, 2> outer_arr)
  : arr(outer_arr)
 {}
 int32 operator()(int32 toAdd)
 {
  for (int i=0;i<arr.size();++i)
  {
   arr[i]+=toAdd;
  }
  return toAdd;
 }
};

The above should give you an example of how lambdas work.

Speed

As you can see, lambdas create these internal classes / structures with your code inside them. When you declare a lambda, it creates an object / instance of the internal type and populates it. When you call a lambda, you are calling the operator function inside the object, not a raw pointer to a function. This means, that the compiler can inline the calls pretty efficiently. If you check the links, that I provided above, you can see that the speed of using lambdas depends a lot on compiler optimizations. With a different set of commands inside the function, lambdas can be both faster or slower than raw function pointers. What about STL function class? Is it slower? It can take both lambdas and raw function pointers, so it is a bit hard to tell. The general consensus before was: "They add a layer of indirection over raw pointers, so they are a few milliseconds slower when doing millions of calls". Nowadays, however, I would vote for using STL functions with lambdas for ease of use. Will they be slower? This benchmark will tell!

Comfort

Ok, so it seems that lambdas are cool and small and fast. Should you always use them? The answer is actually "No". First of all, lambdas won't be faster than direct function calls. They can be inlined better and should still be used where possible. As for situations with function pointers, it depends on the need to be compatible with C and compiler version and so on. The main point, however, should be code readability, supportability and cleanliness. If your code looks much worse with either, think about moving from them.

C Support

C support isn't really important for me. The engine I develop will be 100% compiled with a modern C++ compiler and I don't intend to write compatible pieces of it any time soon (at least with my multithreaded task system), so I won't take it neither as a pro or a con.

Feature set

The feature set is bigger in Marl so it is safe to assume that Marl is slower because of more complex behavior? Not exactly. From what I glanced at, Marl uses assembly-level code to leverage all of the CPU potential, so it really comes down to how effectively JodiTheTigger managed to use Boost library vs Google Staff using their library.

Benchmark

benchmark_sewing.cpp

This was an easy piece of code, as both libraries have neat small beginner examples on how to use them. However, I feel like both libraries lack some docs to fully showcase all of the features.
#include "benchmark_platform.h"
#include "sewing.h"

#include <thread>
#include <atomic>
#include <future>
#include <list>
#include <iostream>
#include <algorithm>

static test_type core_count = (test_type)std::thread::hardware_concurrency();
static std::atomic<test_type> hit_count = 0;

const test_type task_count = 256;

struct argument_t
{
    test_type offset;
    test_type max;
};

struct info_t
{
    argument_t* args;
    size_t job_count;
};

void test_increment(Sew_Procedure_Argument argument)
{
    argument_t* arg = (argument_t*)argument;
    for (test_type i = arg->offset; i < arg->max; i += core_count)
    {
        if (isPrimeNumber(i)) ++hit_count;
    }
}

void main_procedure(struct Sewing* sewing, Sew_Thread*, size_t, Sew_Procedure_Argument arg)
{
    info_t* info = (info_t*)arg;
    Sew_Stitch* jobs = new Sew_Stitch[info->job_count];

    for (test_type i = 0; i < info->job_count; i++)
    {
        jobs[i].procedure = test_increment;
        jobs[i].argument = (Sew_Procedure_Argument)(info->args + i);
        jobs[i].name = "test_increment";
    }

    sew_stitches_and_wait(sewing, jobs, info->job_count);
    delete[] jobs;
}

test_type perform_test(test_type mmax)
{
    hit_count = 0;

    test_type job_count = mmax / (task_count - 1);
    argument_t* args = new argument_t[job_count];
    test_type incr = core_count * task_count;
    test_type index = 0;
    for (test_type i = 2; i < mmax; i += incr)
    {
        for (test_type j = 0; j < core_count; ++j)
        {
            test_type offset = i + j;
            (args + index)->offset = offset;
            (args + index)->max = offset + incr < mmax ? offset + incr : mmax;
            ++index;
        }
    }
    info_t info{ args, index };

    size_t bytes = sew_it(NULL, 32 * 1024, 4, 10, 128, main_procedure, &info);
    void* sewing = new uint8_t[bytes];
    sew_it(sewing, 32 * 1024, 4, 10, 128, main_procedure, &info);
    delete[] sewing;

    delete[] args;
    return hit_count;
}

Nothing complicated here, allocate memory in advance, use the library and free the resources afterwards.

benchmark_marl.cpp

I had a bit of trouble wrapping my head around their function calls and pre-processor definitions, but in the end I liked using Marl a bit more.
#include "benchmark_platform.h"

#include "marl/defer.h"
#include "marl/scheduler.h"
#include "marl/thread.h"
#include "marl/ticket.h"
#include "marl/waitgroup.h"

#include <cmath>

static test_type core_count = (test_type)std::thread::hardware_concurrency();
static std::atomic<test_type> hit_count = 0;

const test_type task_count = 256;

struct argument_t
{
 test_type offset;
 test_type max;
};

void test_increment(argument_t* arg)
{
 for (test_type i = arg->offset; i < arg->max; i += core_count)
 {
  if (isPrimeNumber(i)) ++hit_count;
 }
}

test_type perform_test(test_type mmax)
{
 hit_count = 0;

 test_type job_count = mmax / (task_count - 1);
 argument_t* args = new argument_t[job_count];
 test_type incr = core_count * task_count;
 test_type index = 0;
 for (test_type i = 2; i < mmax; i += incr)
 {
  for (test_type j = 0; j < core_count; ++j)
  {
   test_type offset = i + j;
   (args + index)->offset = offset;
   (args + index)->max = offset + incr < mmax ? offset + incr : mmax;
   ++index;
  }
 }
 job_count = index;

 marl::Scheduler scheduler;
 scheduler.setWorkerThreadCount(core_count);
 scheduler.bind();
 defer(scheduler.unbind());

 marl::WaitGroup calcPrime(job_count);

 for (test_type i = 0; i < job_count; i++) {
  marl::schedule([=] { 
   defer(calcPrime.done());
   test_increment(args + i);
   });
 }

 calcPrime.wait();

 delete[] args;

 return hit_count;
}

When using Marl, be careful not to forget calling done and wait to get the cogs spinning.

Results

The results are here:
Launch 1 Launch 2 Launch 3 Launch 4 Launch 5 Launch 6 Launch 7 Total
Single 18.0532 16.256 16.4315 16.4052 16.4346 16.0924 16.2453 16.3108
Multi 11.0562 9.60286 11.0142 11.0135 9.84951 9.2653 11.0359 10.2969
Sewing 8.72879 8.68421 8.94425 8.67256 8.68754 8.66127 8.66274 8.7188
Marl 6.55166 6.90798 6.38344 6.44576 6.34505 6.35757 6.38263 6.4704

As you can see, these libraries outperformed my implementations by a long shot, which should be expected, given how naively my code was written. What I didn't expect, however, was that Marl would beat Sewing by more than 2 seconds! I haven't done benchmarking on the other 2 libraries, so this will be pretty interesting at the end. I guess, Marl's approach wasn't half-bad after all.

Next time I'll benchmark FiberTaskingLib and fiber-job-system. You can check it out here.

Comments