Let’s consider a simple programming challenge: Summing all elements in a large array.

Now it stands to reason that this can be easily optimized by using parallelism. Especially for huge arrays with thousands or millions of elements. It also stands to reason that the processing time with parallelism should take as much as regular time divided by the number of CPU cores.

As it turns out, this feat is not that easy to achieve. I’ll show you several ways to do this in parallel, how they improve or degrade performance and all the little details that affect performance one way or the other.

For all Benchmarks, I use the excellent BenchmarkDotNet library. It takes care to run several iterations of each benchmark, do warm-up iterations and so on. My PC is Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores. The host is .NET Framework 4.7.2 (CLR 4.0.30319.42000), with a 32bit process.

Establishing a Baseline

First, let’s see the straightforward solution:

The straightforward solution is as simple as it gets, right? Just iterate over the array and sum all the items.

Now, let’s try to beat it with some parallelism.

First attempt at a Parallel Solution

The basic beating-it strategy is as follows:

  • Create X threads (as much as I have CPU Cores)
  • Each thread will run over a range in the array and add up the values
  • Adding to the same variable won’t work from multiple threads, we will need a locking mechanism or get wrong results.

Should be simple enough I guess. Here’s the code:

Note that you have to use a localThread variable to “save” the value of iThread at that point in time. Otherwise, it would be a captured variable that changes as the for loop advances.

Let’s use BenchmarkDotNet to compare the performance of the first parallel solution to the regular-iteration baseline (for 100,000 Items):

RegularIteration161.6 us1.992 us1.765 us
ThreadPoolWithLock8,147.0 us148.396 us138.810 us
The measuring unit ‘us’ stands for microseconds. 1000 us = 1 millisecond

Well, that didn’t do so well. Seems like either the overhead of creating threads or more likely the locking overhead is really hurting performance.

Maybe this can be improved with some optimizations.

Optimizing with Interlocked

The first possible optimization we can do is change the locked with the Interlocked helper class. Consider the expression total += something. It both reads from total, adds items, and then writes back to total. These are 2 operations on the total variable. So with multiple threads, a different thread might do an operation on total during the read and write, corrupting the data. Interlocked.Add(ref total, something) guarantees that this will be a single atomic operation (in-depth explanation here).

Here’s the implementation:

Comparing it to the previous implementation with lock gives the following results (for 100,000 Items):

RegularIteration159.3 us3.136 us4.694 us160.9 us
ThreadPoolWithLock8,271.0 us157.944 us188.021 us8,310.9 us
ThreadPoolInterlocked26,994.1 us2,702.796 us7,489.435 us29,745.2 us

Wow, it actually got much worse when using Interlocked. It seems that the further I try to optimize the worst the performance gets. I should probably stop while I can.

Actually, the reason why this happens is quite sneaky. If you followed closely, you might have noticed that I’m running in a 32-bit process but using a long variable for the total sum. Since long is a 64-bit size variable, and the process is 32-bit, the Interlocked operation is very expensive. Let’s run the same benchmark in a 64-bit process:

Method (64-bit)MeanErrorStdDev
RegularIteration67.34 us1.297 us1.442 us
ThreadPoolWithLock7,895.20 us156.038 us179.694 us
ThreadPoolInterlocked3,107.76 us20.343 us19.029 us

In a 64-bit process, the Interlocked solution is more than twice as fast as the lock solution. And the regular-iteration performance time is now also reduced. It seems that when working with long variables the performance is much better with 64-bit processes. All the next benchmarks from now on will be with a 64-bit process.

So using Interlocked helped a bit, but the result is still far worse than using regular iterative approach.

If you wonder at this point about the volatile keyword, it actually doesn’t help. The issue is that we both read and write from different threads. volatile might work just for read or just for write operations. So by using volatile without an additional lock will give the wrong result. You can read more in Joseph Albahari’s explaination.

Optimizing with Parallel.For

Before going any further, let’s try the fancy Parallel.For feature that was added as part of Task Parallel Library (TPL) in .NET 4.0. This is an optimized way to be able to run for loops on multiple threads. The code is:

Comparing it with the baseline and the previous interlocked implementation:

RegularIteration69.43 us1.373 us1.969 us
ThreadPoolInterlocked3,117.44 us13.520 us12.646 us
ParallelFor3,028.13 us55.719 us52.119 us
Note that Parallel.For doesn’t promise to run all the iterations in parallel. For example, if you were to set the number of parts to 100 and the PC only has 8 cores it’s impossible. It does optimize to run in parallel as much as possible. You can actually set the maximum number of “parallelism” with an additional configuration like this: Parallel.For(from, to, new ParallelOptions() {MaxDegreeOfParallelism = 4}, ...

Alright, so while the syntax is much nicer with Parallel.For, it seems to be just slightly more efficient than manually queuing thread-pool threads. And still far less efficient than just iterating over the array and summing up items.

Eliminating Suspects

The biggest suspect to hurt performance is the Interlocked.Add operation. Let’s just comment it out and see what changes. Instead of Interlocked.Add(ref total, arr[j]) I replaced it with one line that reads from the array x = arr[j]. Here are the results:

RegularIteration57.76 us0.8526 us0.7976 us
ThreadPoolInterlocked34.21 us0.3934 us0.3487 us
ParallelFor33.74 us0.6407 us0.7379 us

Alright, now we’re getting somewhere here. Without the Interlocked line of code, the parallel code runs almost 2 times faster with Parallel.For and the thread-pool strategy.

Let’s do some brainstorming. How can we sum the elements of an array without adding an Interlocked expression?

In the specific case of summing array elements, it can be easily done with per-thread locals. We’ll have a thread-sum for each thread and just add them up in the end. Parallel.For even has a helpful functionality for that. There’s an overload with a parameter called localFinally.

Parallel.For with local thread totals

The idea is that each iteration will return a value. An additional delegate Action will execute at the end of each iteration, with the value as a parameter. In our case, there are 8 iterations, 1 for each thread. Each iteration will return a localTotal which represents the sum total of elements in that part of the array. And the localFinally will sum those up. Here’s the code:

Now when running this against the simple regular-iteration sum:

RegularIteration66.12 us0.8337 us0.7391 us
ParallelFor3,058.63 us58.6600 us65.2005 us
ParallelForWithLocalFinally36.23 us0.6024 us0.5635 us

Great success! We were able to minimize processing time by about 50%. Not as much as the number of cores, but still, mission successful.

Going Deeper

There are still some questions unanswered. Why is the parallelism 2 times faster and not 8, like the number of my cores? What happens with bigger and smaller arrays? Does CPU memory cache play any role in this?

Maybe some more benchmarks will help us understand. Let’s try with a different amounts of items:

RegularIteration50003.312 us0.0642 us0.0789 us
ParallelForWithLocalFinally50009.630 us0.2453 us0.7234 us
RegularIteration100008.157 us0.1588 us0.2121 us
ParallelForWithLocalFinally1000012.705 us0.2319 us0.2055 us
RegularIteration3000019.76 us0.2826 us0.2505 us
ParallelForWithLocalFinally3000016.90 us0.2695 us0.2521 us
RegularIteration10000065.61 us0.9373 us0.8768 us
ParallelForWithLocalFinally10000035.58 us0.6241 us0.5838 us
RegularIteration250000163.55 us1.910 us1.693 us
ParallelForWithLocalFinally25000075.91 us1.679 us1.866 us

So in our scenario, at about 20,000 elements, the regular-iteration strategy is as effective as the parallel strategy. As the number of elements grow, the parallel strategy becomes more effective. At 250,000 elements, the parallel strategy is about 2.2 times faster than regular iteration strategy.

Before drawing any conclusions, let’s do another experiment. This time, instead of summing the elements as they are, we’ll change to string and then parse them back to int. This operation is much heavier and should take more time.

The results this time are:

RegularIteration1000153.9 us3.021 us3.928 us153.0 us
ParallelForWithLocalFinally1000104.5 us3.240 us9.554 us107.6 us
RegularIteration5000781.1 us15.340 us25.203 us780.6 us
ParallelForWithLocalFinally5000412.2 us12.422 us36.625 us410.3 us
RegularIteration100001,551.9 us29.551 us34.031 us1,548.4 us
ParallelForWithLocalFinally10000763.4 us21.766 us64.179 us769.7 us
RegularIteration10000017,349.3 us748.256 us2,206.251 us16,212.8 us
ParallelForWithLocalFinally1000006,238.6 us123.227 us280.650 us6,261.2 us

As you can see, in this case, the parallel solution is as much as 3 times faster than the regular-iteration solution for 100,000 elements. And it starts being more effective for smaller arrays, with as much as 1000 elements.

Conclusions and Summary

I admit that instead of taking the direct route from the challenge to the solution, we kind of took the long, scenic route. But I wanted you to see some of the performance issues along the way. Here are some of my conclusions from this journey:

  1. Optimization with Parallelism can definitely improve performance, but it can easily hurt it as well. It depends on a lot of factors and each case should be measured and checked.
  2. When the various threads need to rely on each other with some kind of locking mechanism, the performance can degrade significantly.
  3. As the last benchmark proved, each use-case is different. For example, in the regular benchmark, parallelism became effective on more than 20,000 array elements. However, in the more complex scenario where we parsed integers to string and back, parallelism was effective even for 1000 elements.
  4. Interlocked works well according to process bitness. In other words, don’t use Interlocked for long variables in 32-bit processes.

Another thing to consider is CPU Cache. It’s hard to measure how exactly it affects the performance, but it’s important to understand that caching happens under the hood without our control or knowledge and it’s more effective without parallelism. The fastest L1 cache is not shared between CPU cores. The mid-level L2 cache is usually shared between 2 CPU cores and only the slowest L3 cache is shared between all the cores. Note that even the “slow” L3 cache is much faster than RAM.

So the article is at an end and I still didn’t figure out why the optimization wasn’t more successful. I did learn a few things and I hope you did too. Feel free to share any insight in the comments section.

Subscribe to get post updates and: Performance Optimizations in C# .NET: 10 Best Practices