Let’s consider a simple programming challenge: Summing all elements in a large array.

Now it stands to reason that this can be easily optimized by using parallelism. Especially for huge arrays with thousands or millions of elements. It also stands to reason that the processing time with parallelism should take as much as regular time divided by the number of CPU cores.

As it turns out, this feat is not that easy to achieve. I’ll show you several ways to do this in parallel, how they improve or degrade performance and all the little details that affect performance one way or the other.

For all Benchmarks, I use the excellent BenchmarkDotNet library. It takes care to run several iterations of each benchmark, do warm-up iterations and so on. My PC is Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores. The host is .NET Framework 4.7.2 (CLR 4.0.30319.42000), with a 32bit process.

Establishing a Baseline

First, let’s see the straightforward solution:

private int ITEMS { get; set; } = 100000;

private int[] arr = null;

public ParallelForeach2()
{
    arr = new int[ITEMS];
    var rnd = new Random();
    for (int i = 0; i < ITEMS; i++)
    {
        arr[i] = rnd.Next(1000);
    }

    partSize = ITEMS / parts;
}

[Benchmark]
public long RegularIterationLocalArr()
{
    long total = 0;
    for (int i = 0; i <  ITEMS; i++)
    {
        total += arr[i];
    }

    return total;
}

The straightforward solution is as simple as it gets, right? Just iterate over the array and sum all the items.

Now, let’s try to beat it with some parallelism.

First attempt at a Parallel Solution

The basic beating-it strategy is as follows:

  • Create X threads (as much as I have CPU Cores)
  • Each thread will run over a range in the array and add up the values
  • Adding to the same variable won’t work from multiple threads, we will need a locking mechanism or get wrong results.

Should be simple enough I guess. Here’s the code:

private object _lock = new object();

[Benchmark]
public long ThreadPoolWithLock()
{
    long total = 0;
    int threads = 8;
    var partSize = ITEMS / threads;
    Task[] tasks = new Task[threads];
    for (int iThread = 0; iThread <  threads; iThread++)
    {
        var localThread = iThread;
        tasks[localThread] = Task.Run(() =>
        {
            for (int j = localThread * partSize; j < (localThread + 1) * partSize; j++)
            {
                lock (_lock)
                {
                    total += arr[j];
                }
            }
        });
    }

    Task.WaitAll(tasks);
    return total;
}

Note that you have to use a localThread variable to “save” the value of iThread at that point in time. Otherwise, it would be a captured variable that changes as the for loop advances.

Let’s use BenchmarkDotNet to compare the performance of the first parallel solution to the regular-iteration baseline (for 100,000 Items):

Method Mean Error StdDev
RegularIteration 161.6 us 1.992 us 1.765 us
ThreadPoolWithLock 8,147.0 us 148.396 us 138.810 us
The measuring unit ‘us’ stands for microseconds. 1000 us = 1 millisecond

Well, that didn’t do so well. Seems like either the overhead of creating threads or more likely the locking overhead is really hurting performance.

Maybe this can be improved with some optimizations.

Optimizing with Interlocked

The first possible optimization we can do is change the locked with the Interlocked helper class. Consider the expression total += something. It both reads from total, adds items, and then writes back to total. These are 2 operations on the total variable. So with multiple threads, a different thread might do an operation on total during the read and write, corrupting the data. Interlocked.Add(ref total, something) guarantees that this will be a single atomic operation (in-depth explanation here ).

Here’s the implementation:

[Benchmark]
public long ThreadPoolStrategy()
{
    long total = 0;
    int threads = 8;
    var partSize = ITEMS / threads;    
    Task[] tasks = new Task[threads];
    for (int iThread = 0; iThread < threads; iThread++)
    {
        var localThread = iThread;
        tasks[localThread] = Task.Run(() =>
        {
            for (int j = localThread * partSize; j < (localThread + 1) * partSize; j++)
            {
                Interlocked.Add(ref total, arr[j]);
            }
        });
    }

    Task.WaitAll(tasks);
    return total;
}

Comparing it to the previous implementation with lock gives the following results (for 100,000 Items):

Method Mean Error StdDev Median
RegularIteration 159.3 us 3.136 us 4.694 us 160.9 us
ThreadPoolWithLock 8,271.0 us 157.944 us 188.021 us 8,310.9 us
ThreadPoolInterlocked 26,994.1 us 2,702.796 us 7,489.435 us 29,745.2 us

Wow, it actually got much worse when using Interlocked. It seems that the further I try to optimize the worst the performance gets. I should probably stop while I can.

Actually, the reason why this happens is quite sneaky. If you followed closely, you might have noticed that I’m running in a 32-bit process but using a long variable for the total sum. Since long is a 64-bit size variable, and the process is 32-bit, the Interlocked operation is very expensive. Let’s run the same benchmark in a 64-bit process:

Method (64-bit) Mean Error StdDev
RegularIteration 67.34 us 1.297 us 1.442 us
ThreadPoolWithLock 7,895.20 us 156.038 us 179.694 us
ThreadPoolInterlocked 3,107.76 us 20.343 us 19.029 us

In a 64-bit process, the Interlocked solution is more than twice as fast as the lock solution. And the regular-iteration performance time is now also reduced. It seems that when working with long variables the performance is much better with 64-bit processes. All the next benchmarks from now on will be with a 64-bit process.

So using Interlocked helped a bit, but the result is still far worse than using regular iterative approach.

If you wonder at this point about the volatile keyword, it actually doesn’t help. The issue is that we both read and write from different threads. volatile might work just for read or just for write operations. So by using volatile without an additional lock will give the wrong result. You can read more in Joseph Albahari’s explaination .

Optimizing with Parallel.For

Before going any further, let’s try the fancy Parallel.For feature that was added as part of Task Parallel Library (TPL) in .NET 4.0. This is an optimized way to be able to run for loops on multiple threads. The code is:

[Benchmark]
public long ParallelFor()
{
    long total = 0;
    int parts = 8;
    int partSize = ITEMS / parts;
    var parallel = Parallel.For(0, parts, new ParallelOptions(), (iter) =>
    {
        for (int j = iter * partSize; j < (iter + 1) * partSize; j++)
        {
            Interlocked.Add(ref total, arr[j]);
        }
    });
    return total;
}

Comparing it with the baseline and the previous interlocked implementation:

Method Mean Error StdDev
RegularIteration 69.43 us 1.373 us 1.969 us
ThreadPoolInterlocked 3,117.44 us 13.520 us 12.646 us
ParallelFor 3,028.13 us 55.719 us 52.119 us
Note that Parallel.For doesn’t promise to run all the iterations in parallel. For example, if you were to set the number of parts to 100 and the PC only has 8 cores it’s impossible. It does optimize to run in parallel as much as possible. You can actually set the maximum number of “parallelism” with an additional configuration like this: Parallel.For(from, to, new ParallelOptions() {MaxDegreeOfParallelism = 4}, ...

Alright, so while the syntax is much nicer with Parallel.For, it seems to be just slightly more efficient than manually queuing thread-pool threads. And still far less efficient than just iterating over the array and summing up items.

Eliminating Suspects

The biggest suspect to hurt performance is the Interlocked.Add operation. Let’s just comment it out and see what changes. Instead of Interlocked.Add(ref total, arr[j]) I replaced it with one line that reads from the array x = arr[j]. Here are the results:

Method Mean Error StdDev
RegularIteration 57.76 us 0.8526 us 0.7976 us
ThreadPoolInterlocked 34.21 us 0.3934 us 0.3487 us
ParallelFor 33.74 us 0.6407 us 0.7379 us

Alright, now we’re getting somewhere here. Without the Interlocked line of code, the parallel code runs almost 2 times faster with Parallel.For and the thread-pool strategy.

Let’s do some brainstorming. How can we sum the elements of an array without adding an Interlocked expression?

In the specific case of summing array elements, it can be easily done with per-thread locals. We’ll have a thread-sum for each thread and just add them up in the end. Parallel.For even has a helpful functionality for that. There’s an overload with a parameter called localFinally.

Parallel.For with local thread totals

The idea is that each iteration will return a value. An additional delegate Action<tvalue></tvalue> will execute at the end of each iteration, with the value as a parameter. In our case, there are 8 iterations, 1 for each thread. Each iteration will return a localTotal which represents the sum total of elements in that part of the array. And the localFinally will sum those up. Here’s the code:

[Benchmark]
public int ParallelForWithLocalFinally()
{
    int total = 0;
    int parts = 8;
    int partSize = ITEMS / parts;
    var parallel = Parallel.For(0, parts, 
        localInit:() => 0L, // Initializes the "localTotal"
        body:(iter, state, localTotal) =>
        {
            for (int j = iter * partSize; j < (iter + 1) * partSize; j++)
            {
                localTotal += arr[j];
            }

            return localTotal;
        },
        localFinally:(localTotal) => { total += localTotal; });
    return total;
}

Now when running this against the simple regular-iteration sum:

Method Mean Error StdDev
RegularIteration 66.12 us 0.8337 us 0.7391 us
ParallelFor 3,058.63 us 58.6600 us 65.2005 us
ParallelForWithLocalFinally 36.23 us 0.6024 us 0.5635 us

Great success! We were able to minimize processing time by about 50%. Not as much as the number of cores, but still, mission successful.

Going Deeper

There are still some questions unanswered. Why is the parallelism 2 times faster and not 8, like the number of my cores? What happens with bigger and smaller arrays? Does CPU memory cache play any role in this?

Maybe some more benchmarks will help us understand. Let’s try with a different amounts of items:

Method ITEMS Mean Error StdDev
RegularIteration 5000 3.312 us 0.0642 us 0.0789 us
ParallelForWithLocalFinally 5000 9.630 us 0.2453 us 0.7234 us
RegularIteration 10000 8.157 us 0.1588 us 0.2121 us
ParallelForWithLocalFinally 10000 12.705 us 0.2319 us 0.2055 us
RegularIteration 30000 19.76 us 0.2826 us 0.2505 us
ParallelForWithLocalFinally 30000 16.90 us 0.2695 us 0.2521 us
RegularIteration 100000 65.61 us 0.9373 us 0.8768 us
ParallelForWithLocalFinally 100000 35.58 us 0.6241 us 0.5838 us
RegularIteration 250000 163.55 us 1.910 us 1.693 us
ParallelForWithLocalFinally 250000 75.91 us 1.679 us 1.866 us

So in our scenario, at about 20,000 elements, the regular-iteration strategy is as effective as the parallel strategy. As the number of elements grow, the parallel strategy becomes more effective. At 250,000 elements, the parallel strategy is about 2.2 times faster than regular iteration strategy.

Before drawing any conclusions, let’s do another experiment. This time, instead of summing the elements as they are, we’ll change to string and then parse them back to int. This operation is much heavier and should take more time.

public double RegularIteration()
{
    double total = 0;
    for (int i = 0; i < ITEMS; i++)
    {
        total += int.Parse(arr[i].ToString());
    }

    return total;
}

The results this time are:

Method ITEMS Mean Error StdDev Median
RegularIteration 1000 153.9 us 3.021 us 3.928 us 153.0 us
ParallelForWithLocalFinally 1000 104.5 us 3.240 us 9.554 us 107.6 us
RegularIteration 5000 781.1 us 15.340 us 25.203 us 780.6 us
ParallelForWithLocalFinally 5000 412.2 us 12.422 us 36.625 us 410.3 us
RegularIteration 10000 1,551.9 us 29.551 us 34.031 us 1,548.4 us
ParallelForWithLocalFinally 10000 763.4 us 21.766 us 64.179 us 769.7 us
RegularIteration 100000 17,349.3 us 748.256 us 2,206.251 us 16,212.8 us
ParallelForWithLocalFinally 100000 6,238.6 us 123.227 us 280.650 us 6,261.2 us

As you can see, in this case, the parallel solution is as much as 3 times faster than the regular-iteration solution for 100,000 elements. And it starts being more effective for smaller arrays, with as much as 1000 elements.

Conclusions and Summary

I admit that instead of taking the direct route from the challenge to the solution, we kind of took the long, scenic route. But I wanted you to see some of the performance issues along the way. Here are some of my conclusions from this journey:

  1. Optimization with Parallelism can definitely improve performance, but it can easily hurt it as well. It depends on a lot of factors and each case should be measured and checked.
  2. When the various threads need to rely on each other with some kind of locking mechanism, the performance can degrade significantly.
  3. As the last benchmark proved, each use-case is different. For example, in the regular benchmark, parallelism became effective on more than 20,000 array elements. However, in the more complex scenario where we parsed integers to string and back, parallelism was effective even for 1000 elements.
  4. Interlocked works well according to process bitness. In other words, don’t use Interlocked for long variables in 32-bit processes.

Another thing to consider is CPU Cache . It’s hard to measure how exactly it affects the performance, but it’s important to understand that caching happens under the hood without our control or knowledge and it’s more effective without parallelism. The fastest L1 cache is not shared between CPU cores. The mid-level L2 cache is usually shared between 2 CPU cores and only the slowest L3 cache is shared between all the cores. Note that even the “slow” L3 cache is much faster than RAM.

So the article is at an end and I still didn’t figure out why the optimization wasn’t more successful. I did learn a few things and I hope you did too. Feel free to share any insight in the comments section.

M ↓   Markdown
?
Anonymous
0 points
6 years ago

This is a great blog post--a colleague of mine just recently switched our inner loop to use Parallel.ForEach to great improvement and would have really benefited from this background. However, in your solution that uses a localFinally delegate, you seem to assume that the localFinally blocks are all synchronized (called within a lock or some other version of synchronization) by the framework:

   localFinally:(localTotal) =&gt; { total += localTotal; });

From the docs on Parallel.ForEach:
"The localFinally delegate is invoked once per thread to perform a final action on each task's local state. This delegate might be invoked concurrently on multiple tasks; therefore, you must synchronize access to any shared variables."

So unless I missed something, the benchmark data you're getting on that version of the code is cheating a little, as the code is not actually thread-safe.

?
Anonymous
0 points
6 years ago

Hi Tobias, it looks like I made a mistake, thanks for pointing it out.

As far as the benchmarks go, this shouldn't make a difference. The 'localFinally' is executed once per thread. So an Interlocked expression (or lock) will be executed 8 times, (negligible)

?
Anonymous
0 points
6 years ago

How about to have an array of totals, each would collect a particular threads total and after threading is done - sum them up. That would avoid any thread unsafety I think.

?
Anonymous
0 points
6 years ago

Also an option, should work

?
Anonymous
0 points
6 years ago

Here's what you want: https://github.com/gatewayp...

The ThreadBlock takes the data to process as a collection, and then has a set of Action and Func definitions that can be applied, such as a Continue With on each thread and Continue With for the whole set. If the thread code returns a value, that value is held in a ConcurrentDictionary that can be processed at the end.

And yes, it's fast.

?
Anonymous
0 points
6 years ago

Hi,

Have you tried with SIMD (Vector.Add in .NET)?
https://docs.microsoft.com/...

Thanks,
Meziantou

?
Anonymous
0 points
6 years ago

Nope, but looking now and looks like it could've made a difference. I can make multiple operations on a single core, right? Anyway I'll look into it more

?
Anonymous
0 points
6 years ago

The idea of using SIMD here intrigued me, so I decided to do some testing on my own. I posted my results here: https://thegreatco.com/post...

?
Anonymous
0 points
6 years ago

Read it now, really interesting how SIMD improves performance. I wonder how much it would be when including the "setup" of both array construction and vector construction.

?
Anonymous
0 points
6 years ago

How would this look if you used async to coordinate the threads target than parallel.for? (Many recommendations are now to just async cpu intensive code to avoid blocking, and showing the overhead of this in a context like this might be useful. )

?
Anonymous
0 points
6 years ago

Hi Andrew,
Async is more useful when you combine it with "await". In other words, you start one task and asynchronously wait for it to finish. When finished you start another. This case is a bit different since we launch 8 tasks in parallel and then wait for them all to finish. That's why Task.WaitAll() or Parallel.For is more suitable here

?
Anonymous
0 points
6 years ago

> The straightforward solution is as simple as it gets, right?

Not quite. If you save the array in a local variable and iterate using .Length instead of ITEMS, you can cut down the loop from 14 instructions to 9:

https://sharplab.io/#gist:5...

I haven't measured it, but that could affect the performance quite a bit.

Also, I would like to see a comparison with PLINQ, which handles all the parallelization for you. Using that, the code would be:

return arr.AsParallel().Sum();
?
Anonymous
0 points
6 years ago

Hi svick,

Thanks for the suggestions:
I benchmarked and got (with 'int' totals and 32-bit process):
| Method | Mean | Error | StdDev |
|-------------------------- |----------:|----------:|-----------:|
| RegularIteration | 83.87 us | 1.6813 us | 3.2792 us |
| RegularIterationWithLocal | 77.50 us | 0.6651 us | 0.6221 us |
| RegularIterationWithPLINQ | 251.90 us | 5.0310 us | 14.5959 us |

Adding the local variable indeed saves about 8%

?
Anonymous
0 points
6 years ago

what happens if instead of partitioning the arrangement in that way you do it in the following way:
thread 1 sum position 1,9, 17 ... (1 + number of threads * n)
thread 2 sum 2,10,18, ... (2 + number of threads * n) and so?

?
Anonymous
0 points
6 years ago

us is µs or microsecond.
My first attempt would have been what you did for your last attempt. Just nowhere near as elegant.
I used to work with game programmers doing business software. They have a different mindset.

?
Anonymous
0 points
6 years ago

I have tried a SIMD version which is pretty fast and is still single threaded:

| Method | Mean | Error | StdDev |
|----------------------------- |----------:|----------:|----------:|
| RegularIterationLocalArr | 56.257 us | 0.1258 us | 0.1116 us |
| Regular_MultipleEntries | 45.188 us | 0.1584 us | 0.1323 us |
| Regular_MultipleEntries_AVX2 | 9.042 us | 0.0166 us | 0.0147 us |

This is nearly 6 times faster on a single core which is not bad.
See https://mijailovic.net/2018... how you can get even more perf out of this.

This uses .NET Core 3.0 Preview:

   [Benchmark]  
    unsafe public long Regular_MultipleEntries_AVX2()  
    {  
        long total = 0;  
        Vector256 result = Vector256.Zero;  
        const int VectorSizeInInts = 8;

        Span data = myArr;  
        fixed (int* ptr = myArr)  
        {  
            int i = 0;  
            for (; i &lt; data.Length-VectorSizeInInts; i += VectorSizeInInts)  
            {  
                var current = Avx.LoadVector256(ptr + i);  
                result = Avx2.Add(current, result);  
            }

            var temp = stackalloc int[VectorSizeInInts];  
            Avx.Store(temp, result);

            total = Sum(temp, VectorSizeInInts);  
            total += Sum(ptr + i, data.Length - i);

            return total;  
        }  
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]  
    private unsafe static long Sum(int* source, int length)  
    {  
        long sum = 0;

        for (int i = 0; i &lt; length; ++i)  
        {  
            sum += source[i];  
        }

        return sum;  
    }
?
Anonymous
0 points
6 years ago

Awesome! Thanks for this

?
Anonymous
0 points
6 years ago

As a professor of mine used to say: If it doesn't have to be correct, I can make it as fast as I want it ;) Joking aside, the shown solutions have a few problems.

A few tests would show that if your item size is not a multiple of your parts (core count I guess) your parallel algorithms are incorrect. Then there's also the fact that all the more complex algorithms forego the required synchronization when adding to total (e.g. in your localFinally expression), but good luck getting that to turn up with normal tests on a desktop x86 machine. (you also switch back to ints instead of longs partially which probably influences the result as well).

All easy to fix, but it goes to show how complex it is to correctly write parallel code even for something as trivial as summing up an array.

Anyhow, to the interesting part. The problem seems pretty simple: You're memory and not CPU bound, so throwing more cores at the problem only helps until you maximize the throughput of your memory controller. That's easy to test by using params for the core count. For my ancient i5 at home it scales reasonably well from 1 to 2 cores, but then there's no difference at all between 3 and 4 cores. It'd be relatively easy to see what throughput we achieve and how close to the theoretical maximum we are (probably not more than 70% I fear)

You could improve this by helping the CPU with the prefetcher and using SIMD, but the C# support for these things is rather lacking - the Unity people are working on some interesting approaches there though. If you wanted to do this for production code, this would be pretty much required for good performance.

I don't think there should be any problem with false sharing given the way the code is structured, but it's always something to keep in mind.

?
Anonymous
0 points
6 years ago

Awesome Daniel, thanks for the insights! I now realize the problems you mentioned but like you said they are easy to fix and don't affect the benchmark results.
See in one of the comments akraus1 offered a solution with SIMD

?
Anonymous
0 points
6 years ago

Great and detailed research!

One point that might be confusing is that you say you are creating threads, but in your code you use Task.Run(), which basically schedules work on a ThreadPool, which might already contain threads ready for execution.

?
Anonymous
0 points
6 years ago

Thanks. You're right, didn't mean to confuse there. I'll see about rephrasing

?
Anonymous
0 points
6 years ago

Sorry if I sound like a grumpy uncle, but summing up locally and adding at the end was the only parallellism solution that came to mind when I read the title of this article. With all fancy methods and jargons and a truck load of analysis for that, was a waste of time.

Looks like technology is making us dumb.

?
Anonymous
0 points
6 years ago

Well, as I said in the end, I took the long scenic route instead of the direct path:)
What I wanted to say was that even though you might "jump" straight to the correct solution, in the end, the benchmarks during the article can offer some insights to parallelism issues.

?
Anonymous
0 points
6 years ago

Hi,
I think you are mixing up things , what is working for you will not work for someone else because same code has different results on different hardware.
Also have you tried to sum array regularly and let CPU and compiler do optimizations for you ?
I have done something similarly here https://github.com/idormenc... .
If you have time try to compare my sum examples , try to use array of doubles and decimal and you will see that sometimes your optimization is not working.

PS: never try to estimate efficiency of a code block only looking at it.