How to Beat Array Iteration Performance with Parallelism in C# .NET

Beating Array Iteration Performance with Parallelism in C# .NET

Let’s consider a simple programming challenge: Summing all elements in a large array.

Now it stands to reason that this can be easily optimized by using parallelism. Especially for huge arrays with thousands or millions of elements. It also stands to reason that the processing time with parallelism should take as much as regular time divided by the number of CPU cores.

As it turns out, this feat is not that easy to achieve. I’ll show you several ways to do this in parallel, how they improve or degrade performance and all the little details that affect performance one way or the other.

For all Benchmarks, I use the excellent BenchmarkDotNet library. It takes care to run several iterations of each benchmark, do warm-up iterations and so on. My PC is Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores. The host is .NET Framework 4.7.2 (CLR 4.0.30319.42000), with a 32bit process.

Establishing a Baseline

First, let’s see the straightforward solution:

The straightforward solution is as simple as it gets, right? Just iterate over the array and sum all the items.

Now, let’s try to beat it with some parallelism.

First attempt at a Parallel Solution

The basic beating-it strategy is as follows:

  • Create X threads (as much as I have CPU Cores)
  • Each thread will run over a range in the array and add up the values
  • Adding to the same variable won’t work from multiple threads, we will need a locking mechanism or get wrong results.

Should be simple enough I guess. Here’s the code:

Note that you have to use a localThread variable to “save” the value of iThread at that point in time. Otherwise, it would be a captured variable that changes as the for loop advances.

Let’s use BenchmarkDotNet to compare the performance of the first parallel solution to the regular-iteration baseline (for 100,000 Items):

RegularIteration161.6 us1.992 us1.765 us
ThreadPoolWithLock8,147.0 us148.396 us138.810 us
The measuring unit ‘us’ stands for microseconds. 1000 us = 1 millisecond

Well, that didn’t do so well. Seems like either the overhead of creating threads or more likely the locking overhead is really hurting performance.

Maybe this can be improved with some optimizations.

Optimizing with Interlocked

The first possible optimization we can do is change the locked with the Interlocked helper class. Consider the expression total += something. It both reads from total, adds items, and then writes back to total. These are 2 operations on the total variable. So with multiple threads, a different thread might do an operation on total during the read and write, corrupting the data. Interlocked.Add(ref total, something) guarantees that this will be a single atomic operation (in-depth explanation here).

Here’s the implementation:

Comparing it to the previous implementation with lock gives the following results (for 100,000 Items):

RegularIteration159.3 us3.136 us4.694 us160.9 us
ThreadPoolWithLock8,271.0 us157.944 us188.021 us8,310.9 us
ThreadPoolInterlocked26,994.1 us2,702.796 us7,489.435 us29,745.2 us

Wow, it actually got much worse when using Interlocked. It seems that the further I try to optimize the worst the performance gets. I should probably stop while I can.

Actually, the reason why this happens is quite sneaky. If you followed closely, you might have noticed that I’m running in a 32-bit process but using a long variable for the total sum. Since long is a 64-bit size variable, and the process is 32-bit, the Interlocked operation is very expensive. Let’s run the same benchmark in a 64-bit process:

Method (64-bit)MeanErrorStdDev
RegularIteration67.34 us1.297 us1.442 us
ThreadPoolWithLock7,895.20 us156.038 us179.694 us
ThreadPoolInterlocked3,107.76 us20.343 us19.029 us

In a 64-bit process, the Interlocked solution is more than twice as fast as the lock solution. And the regular-iteration performance time is now also reduced. It seems that when working with long variables the performance is much better with 64-bit processes. All the next benchmarks from now on will be with a 64-bit process.

So using Interlocked helped a bit, but the result is still far worse than using regular iterative approach.

If you wonder at this point about the volatile keyword, it actually doesn’t help. The issue is that we both read and write from different threads. volatile might work just for read or just for write operations. So by using volatile without an additional lock will give the wrong result. You can read more in Joseph Albahari’s explaination.

Optimizing with Parallel.For

Before going any further, let’s try the fancy Parallel.For feature that was added as part of Task Parallel Library (TPL) in .NET 4.0. This is an optimized way to be able to run for loops on multiple threads. The code is:

Comparing it with the baseline and the previous interlocked implementation:

RegularIteration69.43 us1.373 us1.969 us
ThreadPoolInterlocked3,117.44 us13.520 us12.646 us
ParallelFor3,028.13 us55.719 us52.119 us
Note that Parallel.For doesn’t promise to run all the iterations in parallel. For example, if you were to set the number of parts to 100 and the PC only has 8 cores it’s impossible. It does optimize to run in parallel as much as possible. You can actually set the maximum number of “parallelism” with an additional configuration like this: Parallel.For(from, to, new ParallelOptions() {MaxDegreeOfParallelism = 4}, ...

Alright, so while the syntax is much nicer with Parallel.For, it seems to be just slightly more efficient than manually queuing thread-pool threads. And still far less efficient than just iterating over the array and summing up items.

Eliminating Suspects

The biggest suspect to hurt performance is the Interlocked.Add operation. Let’s just comment it out and see what changes. Instead of Interlocked.Add(ref total, arr[j]) I replaced it with one line that reads from the array x = arr[j]. Here are the results:

RegularIteration57.76 us0.8526 us0.7976 us
ThreadPoolInterlocked34.21 us0.3934 us0.3487 us
ParallelFor33.74 us0.6407 us0.7379 us

Alright, now we’re getting somewhere here. Without the Interlocked line of code, the parallel code runs almost 2 times faster with Parallel.For and the thread-pool strategy.

Let’s do some brainstorming. How can we sum the elements of an array without adding an Interlocked expression?

In the specific case of summing array elements, it can be easily done with per-thread locals. We’ll have a thread-sum for each thread and just add them up in the end. Parallel.For even has a helpful functionality for that. There’s an overload with a parameter called localFinally.

Parallel.For with local thread totals

The idea is that each iteration will return a value. An additional delegate Action will execute at the end of each iteration, with the value as a parameter. In our case, there are 8 iterations, 1 for each thread. Each iteration will return a localTotal which represents the sum total of elements in that part of the array. And the localFinally will sum those up. Here’s the code:

Now when running this against the simple regular-iteration sum:

RegularIteration66.12 us0.8337 us0.7391 us
ParallelFor3,058.63 us58.6600 us65.2005 us
ParallelForWithLocalFinally36.23 us0.6024 us0.5635 us

Great success! We were able to minimize processing time by about 50%. Not as much as the number of cores, but still, mission successful.

Going Deeper

There are still some questions unanswered. Why is the parallelism 2 times faster and not 8, like the number of my cores? What happens with bigger and smaller arrays? Does CPU memory cache play any role in this?

Maybe some more benchmarks will help us understand. Let’s try with a different amounts of items:

RegularIteration50003.312 us0.0642 us0.0789 us
ParallelForWithLocalFinally50009.630 us0.2453 us0.7234 us
RegularIteration100008.157 us0.1588 us0.2121 us
ParallelForWithLocalFinally1000012.705 us0.2319 us0.2055 us
RegularIteration3000019.76 us0.2826 us0.2505 us
ParallelForWithLocalFinally3000016.90 us0.2695 us0.2521 us
RegularIteration10000065.61 us0.9373 us0.8768 us
ParallelForWithLocalFinally10000035.58 us0.6241 us0.5838 us
RegularIteration250000163.55 us1.910 us1.693 us
ParallelForWithLocalFinally25000075.91 us1.679 us1.866 us

So in our scenario, at about 20,000 elements, the regular-iteration strategy is as effective as the parallel strategy. As the number of elements grow, the parallel strategy becomes more effective. At 250,000 elements, the parallel strategy is about 2.2 times faster than regular iteration strategy.

Before drawing any conclusions, let’s do another experiment. This time, instead of summing the elements as they are, we’ll change to string and then parse them back to int. This operation is much heavier and should take more time.

The results this time are:

RegularIteration1000153.9 us3.021 us3.928 us153.0 us
ParallelForWithLocalFinally1000104.5 us3.240 us9.554 us107.6 us
RegularIteration5000781.1 us15.340 us25.203 us780.6 us
ParallelForWithLocalFinally5000412.2 us12.422 us36.625 us410.3 us
RegularIteration100001,551.9 us29.551 us34.031 us1,548.4 us
ParallelForWithLocalFinally10000763.4 us21.766 us64.179 us769.7 us
RegularIteration10000017,349.3 us748.256 us2,206.251 us16,212.8 us
ParallelForWithLocalFinally1000006,238.6 us123.227 us280.650 us6,261.2 us

As you can see, in this case, the parallel solution is as much as 3 times faster than the regular-iteration solution for 100,000 elements. And it starts being more effective for smaller arrays, with as much as 1000 elements.

Conclusions and Summary

I admit that instead of taking the direct route from the challenge to the solution, we kind of took the long, scenic route. But I wanted you to see some of the performance issues along the way. Here are some of my conclusions from this journey:

  1. Optimization with Parallelism can definitely improve performance, but it can easily hurt it as well. It depends on a lot of factors and each case should be measured and checked.
  2. When the various threads need to rely on each other with some kind of locking mechanism, the performance can degrade significantly.
  3. As the last benchmark proved, each use-case is different. For example, in the regular benchmark, parallelism became effective on more than 20,000 array elements. However, in the more complex scenario where we parsed integers to string and back, parallelism was effective even for 1000 elements.
  4. Interlocked works well according to process bitness. In other words, don’t use Interlocked for long variables in 32-bit processes.

Another thing to consider is CPU Cache. It’s hard to measure how exactly it affects the performance, but it’s important to understand that caching happens under the hood without our control or knowledge and it’s more effective without parallelism. The fastest L1 cache is not shared between CPU cores. The mid-level L2 cache is usually shared between 2 CPU cores and only the slowest L3 cache is shared between all the cores. Note that even the “slow” L3 cache is much faster than RAM.

So the article is at an end and I still didn’t figure out why the optimization wasn’t more successful. I did learn a few things and I hope you did too. Feel free to share any insight in the comments section.


Enjoy the blog? I would love you to subscribe! Performance Optimizations in C#: 10 Best Practices (exclusive article)

Want to become an expert problem solver? Check out a chapter from my book Practical Debugging for .NET Developers

24 thoughts on “How to Beat Array Iteration Performance with Parallelism in C# .NET”

  1. This is a great blog post–a colleague of mine just recently switched our inner loop to use Parallel.ForEach to great improvement and would have really benefited from this background. However, in your solution that uses a localFinally delegate, you seem to assume that the localFinally blocks are all synchronized (called within a lock or some other version of synchronization) by the framework:

    localFinally:(localTotal) => { total += localTotal; });

    From the docs on Parallel.ForEach:
    “The localFinally delegate is invoked once per thread to perform a final action on each task’s local state. This delegate might be invoked concurrently on multiple tasks; therefore, you must synchronize access to any shared variables.”

    So unless I missed something, the benchmark data you’re getting on that version of the code is cheating a little, as the code is not actually thread-safe.

    1. Hi Tobias, it looks like I made a mistake, thanks for pointing it out.

      As far as the benchmarks go, this shouldn’t make a difference. The ‘localFinally’ is executed once per thread. So an Interlocked expression (or lock) will be executed 8 times, (negligible)

      1. How about to have an array of totals, each would collect a particular threads total and after threading is done – sum them up. That would avoid any thread unsafety I think.

        1. Here’s what you want:

          The ThreadBlock takes the data to process as a collection, and then has a set of Action and Func definitions that can be applied, such as a Continue With on each thread and Continue With for the whole set. If the thread code returns a value, that value is held in a ConcurrentDictionary that can be processed at the end.

          And yes, it’s fast.

    1. Nope, but looking now and looks like it could’ve made a difference. I can make multiple operations on a single core, right? Anyway I’ll look into it more

        1. Read it now, really interesting how SIMD improves performance. I wonder how much it would be when including the “setup” of both array construction and vector construction.

  2. How would this look if you used async to coordinate the threads target than parallel.for? (Many recommendations are now to just async cpu intensive code to avoid blocking, and showing the overhead of this in a context like this might be useful. )

    1. Hi Andrew,
      Async is more useful when you combine it with “await”. In other words, you start one task and asynchronously wait for it to finish. When finished you start another. This case is a bit different since we launch 8 tasks in parallel and then wait for them all to finish. That’s why Task.WaitAll() or Parallel.For is more suitable here

  3. I have tried a SIMD version which is pretty fast and is still single threaded:

    | Method | Mean | Error | StdDev |
    |—————————– |———-:|———-:|———-:|
    | RegularIterationLocalArr | 56.257 us | 0.1258 us | 0.1116 us |
    | Regular_MultipleEntries | 45.188 us | 0.1584 us | 0.1323 us |
    | Regular_MultipleEntries_AVX2 | 9.042 us | 0.0166 us | 0.0147 us |

    This is nearly 6 times faster on a single core which is not bad.
    See how you can get even more perf out of this.

    This uses .NET Core 3.0 Preview:

    unsafe public long Regular_MultipleEntries_AVX2()
    long total = 0;
    Vector256 result = Vector256.Zero;
    const int VectorSizeInInts = 8;

    Span data = myArr;
    fixed (int* ptr = myArr)
    int i = 0;
    for (; i < data.Length-VectorSizeInInts; i += VectorSizeInInts)
    var current = Avx.LoadVector256(ptr + i);
    result = Avx2.Add(current, result);

    var temp = stackalloc int[VectorSizeInInts];
    Avx.Store(temp, result);

    total = Sum(temp, VectorSizeInInts);
    total += Sum(ptr + i, data.Length – i);

    return total;

    private unsafe static long Sum(int* source, int length)
    long sum = 0;

    for (int i = 0; i < length; ++i)
    sum += source[i];

    return sum;

  4. As a professor of mine used to say: If it doesn’t have to be correct, I can make it as fast as I want it 😉 Joking aside, the shown solutions have a few problems.

    A few tests would show that if your item size is not a multiple of your parts (core count I guess) your parallel algorithms are incorrect. Then there’s also the fact that all the more complex algorithms forego the required synchronization when adding to total (e.g. in your localFinally expression), but good luck getting that to turn up with normal tests on a desktop x86 machine. (you also switch back to ints instead of longs partially which probably influences the result as well).

    All easy to fix, but it goes to show how complex it is to correctly write parallel code even for something as trivial as summing up an array.

    Anyhow, to the interesting part. The problem seems pretty simple: You’re memory and not CPU bound, so throwing more cores at the problem only helps until you maximize the throughput of your memory controller. That’s easy to test by using params for the core count. For my ancient i5 at home it scales reasonably well from 1 to 2 cores, but then there’s no difference at all between 3 and 4 cores. It’d be relatively easy to see what throughput we achieve and how close to the theoretical maximum we are (probably not more than 70% I fear)

    You could improve this by helping the CPU with the prefetcher and using SIMD, but the C# support for these things is rather lacking – the Unity people are working on some interesting approaches there though. If you wanted to do this for production code, this would be pretty much required for good performance.

    I don’t think there should be any problem with false sharing given the way the code is structured, but it’s always something to keep in mind.

    1. Awesome Daniel, thanks for the insights! I now realize the problems you mentioned but like you said they are easy to fix and don’t affect the benchmark results.
      See in one of the comments akraus1 offered a solution with SIMD

  5. Sorry if I sound like a grumpy uncle, but summing up locally and adding at the end was the only parallellism solution that came to mind when I read the title of this article. With all fancy methods and jargons and a truck load of analysis for that, was a waste of time.

    Looks like technology is making us dumb.

    1. Well, as I said in the end, I took the long scenic route instead of the direct path:)
      What I wanted to say was that even though you might “jump” straight to the correct solution, in the end, the benchmarks during the article can offer some insights to parallelism issues.

  6. Hi,
    I think you are mixing up things , what is working for you will not work for someone else because same code has different results on different hardware.
    Also have you tried to sum array regularly and let CPU and compiler do optimizations for you ?
    I have done something similarly here .
    If you have time try to compare my sum examples , try to use array of doubles and decimal and you will see that sometimes your optimization is not working.

    PS: never try to estimate efficiency of a code block only looking at it.

  7. > The straightforward solution is as simple as it gets, right?

    Not quite. If you save the array in a local variable and iterate using .Length instead of ITEMS, you can cut down the loop from 14 instructions to 9:

    I haven’t measured it, but that could affect the performance quite a bit.

    Also, I would like to see a comparison with PLINQ, which handles all the parallelization for you. Using that, the code would be:

    return arr.AsParallel().Sum();

    1. Hi svick,

      Thanks for the suggestions:
      I benchmarked and got (with ‘int’ totals and 32-bit process):
      | Method | Mean | Error | StdDev |
      |————————– |———-:|———-:|———–:|
      | RegularIteration | 83.87 us | 1.6813 us | 3.2792 us |
      | RegularIterationWithLocal | 77.50 us | 0.6651 us | 0.6221 us |
      | RegularIterationWithPLINQ | 251.90 us | 5.0310 us | 14.5959 us |

      Adding the local variable indeed saves about 8%

  8. what happens if instead of partitioning the arrangement in that way you do it in the following way:
    thread 1 sum position 1,9, 17 … (1 + number of threads * n)
    thread 2 sum 2,10,18, … (2 + number of threads * n) and so?

  9. Great and detailed research!

    One point that might be confusing is that you say you are creating threads, but in your code you use Task.Run(), which basically schedules work on a ThreadPool, which might already contain threads ready for execution.

Comments are closed.