Let’s consider a simple programming challenge: Summing all elements in a large array.

Now it stands to reason that this can be easily optimized by using parallelism. Especially for huge arrays with thousands or millions of elements. It also stands to reason that the processing time with parallelism should take as much as regular time divided by the number of CPU cores.

As it turns out, this feat is not that easy to achieve. I’ll show you several ways to do this in parallel, how they improve or degrade performance and all the little details that affect performance one way or the other.

For all Benchmarks, I use the excellent BenchmarkDotNet library. It takes care to run several iterations of each benchmark, do warm-up iterations and so on. My PC is Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores. The host is .NET Framework 4.7.2 (CLR 4.0.30319.42000), with a 32bit process.

Establishing a Baseline

First, let’s see the straightforward solution:

private int ITEMS { get; set; } = 100000;

private int[] arr = null;

public ParallelForeach2()
{
    arr = new int[ITEMS];
    var rnd = new Random();
    for (int i = 0; i < ITEMS; i++)
    {
        arr[i] = rnd.Next(1000);
    }

    partSize = ITEMS / parts;
}

[Benchmark]
public long RegularIterationLocalArr()
{
    long total = 0;
    for (int i = 0; i <  ITEMS; i++)
    {
        total += arr[i];
    }

    return total;
}

The straightforward solution is as simple as it gets, right? Just iterate over the array and sum all the items.

Now, let’s try to beat it with some parallelism.

First attempt at a Parallel Solution

The basic beating-it strategy is as follows:

  • Create X threads (as much as I have CPU Cores)
  • Each thread will run over a range in the array and add up the values
  • Adding to the same variable won’t work from multiple threads, we will need a locking mechanism or get wrong results.

Should be simple enough I guess. Here’s the code:

private object _lock = new object();

[Benchmark]
public long ThreadPoolWithLock()
{
    long total = 0;
    int threads = 8;
    var partSize = ITEMS / threads;
    Task[] tasks = new Task[threads];
    for (int iThread = 0; iThread <  threads; iThread++)
    {
        var localThread = iThread;
        tasks[localThread] = Task.Run(() =>
        {
            for (int j = localThread * partSize; j < (localThread + 1) * partSize; j++)
            {
                lock (_lock)
                {
                    total += arr[j];
                }
            }
        });
    }

    Task.WaitAll(tasks);
    return total;
}

Note that you have to use a localThread variable to “save” the value of iThread at that point in time. Otherwise, it would be a captured variable that changes as the for loop advances.

Let’s use BenchmarkDotNet to compare the performance of the first parallel solution to the regular-iteration baseline (for 100,000 Items):

Method Mean Error StdDev
RegularIteration 161.6 us 1.992 us 1.765 us
ThreadPoolWithLock 8,147.0 us 148.396 us 138.810 us
The measuring unit ‘us’ stands for microseconds. 1000 us = 1 millisecond

Well, that didn’t do so well. Seems like either the overhead of creating threads or more likely the locking overhead is really hurting performance.

Maybe this can be improved with some optimizations.

Optimizing with Interlocked

The first possible optimization we can do is change the locked with the Interlocked helper class. Consider the expression total += something. It both reads from total, adds items, and then writes back to total. These are 2 operations on the total variable. So with multiple threads, a different thread might do an operation on total during the read and write, corrupting the data. Interlocked.Add(ref total, something) guarantees that this will be a single atomic operation (in-depth explanation here ).

Here’s the implementation:

[Benchmark]
public long ThreadPoolStrategy()
{
    long total = 0;
    int threads = 8;
    var partSize = ITEMS / threads;    
    Task[] tasks = new Task[threads];
    for (int iThread = 0; iThread < threads; iThread++)
    {
        var localThread = iThread;
        tasks[localThread] = Task.Run(() =>
        {
            for (int j = localThread * partSize; j < (localThread + 1) * partSize; j++)
            {
                Interlocked.Add(ref total, arr[j]);
            }
        });
    }

    Task.WaitAll(tasks);
    return total;
}

Comparing it to the previous implementation with lock gives the following results (for 100,000 Items):

Method Mean Error StdDev Median
RegularIteration 159.3 us 3.136 us 4.694 us 160.9 us
ThreadPoolWithLock 8,271.0 us 157.944 us 188.021 us 8,310.9 us
ThreadPoolInterlocked 26,994.1 us 2,702.796 us 7,489.435 us 29,745.2 us

Wow, it actually got much worse when using Interlocked. It seems that the further I try to optimize the worst the performance gets. I should probably stop while I can.

Actually, the reason why this happens is quite sneaky. If you followed closely, you might have noticed that I’m running in a 32-bit process but using a long variable for the total sum. Since long is a 64-bit size variable, and the process is 32-bit, the Interlocked operation is very expensive. Let’s run the same benchmark in a 64-bit process:

Method (64-bit) Mean Error StdDev
RegularIteration 67.34 us 1.297 us 1.442 us
ThreadPoolWithLock 7,895.20 us 156.038 us 179.694 us
ThreadPoolInterlocked 3,107.76 us 20.343 us 19.029 us

In a 64-bit process, the Interlocked solution is more than twice as fast as the lock solution. And the regular-iteration performance time is now also reduced. It seems that when working with long variables the performance is much better with 64-bit processes. All the next benchmarks from now on will be with a 64-bit process.

So using Interlocked helped a bit, but the result is still far worse than using regular iterative approach.

If you wonder at this point about the volatile keyword, it actually doesn’t help. The issue is that we both read and write from different threads. volatile might work just for read or just for write operations. So by using volatile without an additional lock will give the wrong result. You can read more in Joseph Albahari’s explaination .

Optimizing with Parallel.For

Before going any further, let’s try the fancy Parallel.For feature that was added as part of Task Parallel Library (TPL) in .NET 4.0. This is an optimized way to be able to run for loops on multiple threads. The code is:

[Benchmark]
public long ParallelFor()
{
    long total = 0;
    int parts = 8;
    int partSize = ITEMS / parts;
    var parallel = Parallel.For(0, parts, new ParallelOptions(), (iter) =>
    {
        for (int j = iter * partSize; j < (iter + 1) * partSize; j++)
        {
            Interlocked.Add(ref total, arr[j]);
        }
    });
    return total;
}

Comparing it with the baseline and the previous interlocked implementation:

Method Mean Error StdDev
RegularIteration 69.43 us 1.373 us 1.969 us
ThreadPoolInterlocked 3,117.44 us 13.520 us 12.646 us
ParallelFor 3,028.13 us 55.719 us 52.119 us
Note that Parallel.For doesn’t promise to run all the iterations in parallel. For example, if you were to set the number of parts to 100 and the PC only has 8 cores it’s impossible. It does optimize to run in parallel as much as possible. You can actually set the maximum number of “parallelism” with an additional configuration like this: Parallel.For(from, to, new ParallelOptions() {MaxDegreeOfParallelism = 4}, ...

Alright, so while the syntax is much nicer with Parallel.For, it seems to be just slightly more efficient than manually queuing thread-pool threads. And still far less efficient than just iterating over the array and summing up items.

Eliminating Suspects

The biggest suspect to hurt performance is the Interlocked.Add operation. Let’s just comment it out and see what changes. Instead of Interlocked.Add(ref total, arr[j]) I replaced it with one line that reads from the array x = arr[j]. Here are the results:

Method Mean Error StdDev
RegularIteration 57.76 us 0.8526 us 0.7976 us
ThreadPoolInterlocked 34.21 us 0.3934 us 0.3487 us
ParallelFor 33.74 us 0.6407 us 0.7379 us

Alright, now we’re getting somewhere here. Without the Interlocked line of code, the parallel code runs almost 2 times faster with Parallel.For and the thread-pool strategy.

Let’s do some brainstorming. How can we sum the elements of an array without adding an Interlocked expression?

In the specific case of summing array elements, it can be easily done with per-thread locals. We’ll have a thread-sum for each thread and just add them up in the end. Parallel.For even has a helpful functionality for that. There’s an overload with a parameter called localFinally.

Parallel.For with local thread totals

The idea is that each iteration will return a value. An additional delegate Action<tvalue></tvalue> will execute at the end of each iteration, with the value as a parameter. In our case, there are 8 iterations, 1 for each thread. Each iteration will return a localTotal which represents the sum total of elements in that part of the array. And the localFinally will sum those up. Here’s the code:

[Benchmark]
public int ParallelForWithLocalFinally()
{
    int total = 0;
    int parts = 8;
    int partSize = ITEMS / parts;
    var parallel = Parallel.For(0, parts, 
        localInit:() => 0L, // Initializes the "localTotal"
        body:(iter, state, localTotal) =>
        {
            for (int j = iter * partSize; j < (iter + 1) * partSize; j++)
            {
                localTotal += arr[j];
            }

            return localTotal;
        },
        localFinally:(localTotal) => { total += localTotal; });
    return total;
}

Now when running this against the simple regular-iteration sum:

Method Mean Error StdDev
RegularIteration 66.12 us 0.8337 us 0.7391 us
ParallelFor 3,058.63 us 58.6600 us 65.2005 us
ParallelForWithLocalFinally 36.23 us 0.6024 us 0.5635 us

Great success! We were able to minimize processing time by about 50%. Not as much as the number of cores, but still, mission successful.

Going Deeper

There are still some questions unanswered. Why is the parallelism 2 times faster and not 8, like the number of my cores? What happens with bigger and smaller arrays? Does CPU memory cache play any role in this?

Maybe some more benchmarks will help us understand. Let’s try with a different amounts of items:

Method ITEMS Mean Error StdDev
RegularIteration 5000 3.312 us 0.0642 us 0.0789 us
ParallelForWithLocalFinally 5000 9.630 us 0.2453 us 0.7234 us
RegularIteration 10000 8.157 us 0.1588 us 0.2121 us
ParallelForWithLocalFinally 10000 12.705 us 0.2319 us 0.2055 us
RegularIteration 30000 19.76 us 0.2826 us 0.2505 us
ParallelForWithLocalFinally 30000 16.90 us 0.2695 us 0.2521 us
RegularIteration 100000 65.61 us 0.9373 us 0.8768 us
ParallelForWithLocalFinally 100000 35.58 us 0.6241 us 0.5838 us
RegularIteration 250000 163.55 us 1.910 us 1.693 us
ParallelForWithLocalFinally 250000 75.91 us 1.679 us 1.866 us

So in our scenario, at about 20,000 elements, the regular-iteration strategy is as effective as the parallel strategy. As the number of elements grow, the parallel strategy becomes more effective. At 250,000 elements, the parallel strategy is about 2.2 times faster than regular iteration strategy.

Before drawing any conclusions, let’s do another experiment. This time, instead of summing the elements as they are, we’ll change to string and then parse them back to int. This operation is much heavier and should take more time.

public double RegularIteration()
{
    double total = 0;
    for (int i = 0; i < ITEMS; i++)
    {
        total += int.Parse(arr[i].ToString());
    }

    return total;
}

The results this time are:

Method ITEMS Mean Error StdDev Median
RegularIteration 1000 153.9 us 3.021 us 3.928 us 153.0 us
ParallelForWithLocalFinally 1000 104.5 us 3.240 us 9.554 us 107.6 us
RegularIteration 5000 781.1 us 15.340 us 25.203 us 780.6 us
ParallelForWithLocalFinally 5000 412.2 us 12.422 us 36.625 us 410.3 us
RegularIteration 10000 1,551.9 us 29.551 us 34.031 us 1,548.4 us
ParallelForWithLocalFinally 10000 763.4 us 21.766 us 64.179 us 769.7 us
RegularIteration 100000 17,349.3 us 748.256 us 2,206.251 us 16,212.8 us
ParallelForWithLocalFinally 100000 6,238.6 us 123.227 us 280.650 us 6,261.2 us

As you can see, in this case, the parallel solution is as much as 3 times faster than the regular-iteration solution for 100,000 elements. And it starts being more effective for smaller arrays, with as much as 1000 elements.

Conclusions and Summary

I admit that instead of taking the direct route from the challenge to the solution, we kind of took the long, scenic route. But I wanted you to see some of the performance issues along the way. Here are some of my conclusions from this journey:

  1. Optimization with Parallelism can definitely improve performance, but it can easily hurt it as well. It depends on a lot of factors and each case should be measured and checked.
  2. When the various threads need to rely on each other with some kind of locking mechanism, the performance can degrade significantly.
  3. As the last benchmark proved, each use-case is different. For example, in the regular benchmark, parallelism became effective on more than 20,000 array elements. However, in the more complex scenario where we parsed integers to string and back, parallelism was effective even for 1000 elements.
  4. Interlocked works well according to process bitness. In other words, don’t use Interlocked for long variables in 32-bit processes.

Another thing to consider is CPU Cache . It’s hard to measure how exactly it affects the performance, but it’s important to understand that caching happens under the hood without our control or knowledge and it’s more effective without parallelism. The fastest L1 cache is not shared between CPU cores. The mid-level L2 cache is usually shared between 2 CPU cores and only the slowest L3 cache is shared between all the cores. Note that even the “slow” L3 cache is much faster than RAM.

So the article is at an end and I still didn’t figure out why the optimization wasn’t more successful. I did learn a few things and I hope you did too. Feel free to share any insight in the comments section.