Let’s consider a simple programming challenge: Summing all elements in a large array.
Now it stands to reason that this can be easily optimized by using parallelism. Especially for huge arrays with thousands or millions of elements. It also stands to reason that the processing time with parallelism should take as much as regular time divided by the number of CPU cores.
As it turns out, this feat is not that easy to achieve. I’ll show you several ways to do this in parallel, how they improve or degrade performance and all the little details that affect performance one way or the other.
For all Benchmarks, I use the excellent BenchmarkDotNet library. It takes care to run several iterations of each benchmark, do warm-up iterations and so on. My PC is Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores. The host is .NET Framework 4.7.2 (CLR 4.0.30319.42000), with a 32bit process.
Establishing a Baseline
First, let’s see the straightforward solution:
private int ITEMS { get; set; } = 100000;
private int[] arr = null;
public ParallelForeach2()
{
arr = new int[ITEMS];
var rnd = new Random();
for (int i = 0; i < ITEMS; i++)
{
arr[i] = rnd.Next(1000);
}
partSize = ITEMS / parts;
}
[Benchmark]
public long RegularIterationLocalArr()
{
long total = 0;
for (int i = 0; i < ITEMS; i++)
{
total += arr[i];
}
return total;
}
The straightforward solution is as simple as it gets, right? Just iterate over the array and sum all the items.
Now, let’s try to beat it with some parallelism.
First attempt at a Parallel Solution
The basic beating-it strategy is as follows:
- Create X threads (as much as I have CPU Cores)
- Each thread will run over a range in the array and add up the values
- Adding to the same variable won’t work from multiple threads, we will need a locking mechanism or get wrong results.
Should be simple enough I guess. Here’s the code:
private object _lock = new object();
[Benchmark]
public long ThreadPoolWithLock()
{
long total = 0;
int threads = 8;
var partSize = ITEMS / threads;
Task[] tasks = new Task[threads];
for (int iThread = 0; iThread < threads; iThread++)
{
var localThread = iThread;
tasks[localThread] = Task.Run(() =>
{
for (int j = localThread * partSize; j < (localThread + 1) * partSize; j++)
{
lock (_lock)
{
total += arr[j];
}
}
});
}
Task.WaitAll(tasks);
return total;
}
Note that you have to use a localThread
variable to “save” the value of iThread
at that point in time. Otherwise, it would be a captured variable that changes as the for
loop advances.
Let’s use BenchmarkDotNet to compare the performance of the first parallel solution to the regular-iteration baseline (for 100,000 Items):
Method | Mean | Error | StdDev |
---|---|---|---|
RegularIteration | 161.6 us | 1.992 us | 1.765 us |
ThreadPoolWithLock | 8,147.0 us | 148.396 us | 138.810 us |
The measuring unit ‘us’ stands for microseconds. 1000 us = 1 millisecond
Well, that didn’t do so well. Seems like either the overhead of creating threads or more likely the locking overhead is really hurting performance.
Maybe this can be improved with some optimizations.
Optimizing with Interlocked
The first possible optimization we can do is change the locked with the Interlocked
helper class. Consider the expression total += something
. It both reads from total, adds items, and then writes back to total. These are 2 operations on the total
variable. So with multiple threads, a different thread might do an operation on total
during the read and write, corrupting the data. Interlocked.Add(ref total, something)
guarantees that this will be a single atomic operation (in-depth explanation here
).
Here’s the implementation:
[Benchmark]
public long ThreadPoolStrategy()
{
long total = 0;
int threads = 8;
var partSize = ITEMS / threads;
Task[] tasks = new Task[threads];
for (int iThread = 0; iThread < threads; iThread++)
{
var localThread = iThread;
tasks[localThread] = Task.Run(() =>
{
for (int j = localThread * partSize; j < (localThread + 1) * partSize; j++)
{
Interlocked.Add(ref total, arr[j]);
}
});
}
Task.WaitAll(tasks);
return total;
}
Comparing it to the previous implementation with lock
gives the following results (for 100,000 Items):
Method | Mean | Error | StdDev | Median |
---|---|---|---|---|
RegularIteration | 159.3 us | 3.136 us | 4.694 us | 160.9 us |
ThreadPoolWithLock | 8,271.0 us | 157.944 us | 188.021 us | 8,310.9 us |
ThreadPoolInterlocked | 26,994.1 us | 2,702.796 us | 7,489.435 us | 29,745.2 us |
Wow, it actually got much worse when using Interlocked
. It seems that the further I try to optimize the worst the performance gets. I should probably stop while I can.
Actually, the reason why this happens is quite sneaky. If you followed closely, you might have noticed that I’m running in a 32-bit process but using a long variable for the total sum. Since long is a 64-bit size variable, and the process is 32-bit, the Interlocked
operation is very expensive. Let’s run the same benchmark in a 64-bit process:
Method (64-bit) | Mean | Error | StdDev |
---|---|---|---|
RegularIteration | 67.34 us | 1.297 us | 1.442 us |
ThreadPoolWithLock | 7,895.20 us | 156.038 us | 179.694 us |
ThreadPoolInterlocked | 3,107.76 us | 20.343 us | 19.029 us |
In a 64-bit process, the Interlocked
solution is more than twice as fast as the lock
solution. And the regular-iteration performance time is now also reduced. It seems that when working with long
variables the performance is much better with 64-bit processes. All the next benchmarks from now on will be with a 64-bit process.
So using Interlocked
helped a bit, but the result is still far worse than using regular iterative approach.
If you wonder at this point about the volatile
keyword, it actually doesn’t help. The issue is that we both read and write from different threads. volatile
might work just for read or just for write operations. So by using volatile
without an additional lock will give the wrong result. You can read more in Joseph Albahari’s explaination
.
Optimizing with Parallel.For
Before going any further, let’s try the fancy Parallel.For
feature that was added as part of Task Parallel Library (TPL) in .NET 4.0. This is an optimized way to be able to run for
loops on multiple threads. The code is:
[Benchmark]
public long ParallelFor()
{
long total = 0;
int parts = 8;
int partSize = ITEMS / parts;
var parallel = Parallel.For(0, parts, new ParallelOptions(), (iter) =>
{
for (int j = iter * partSize; j < (iter + 1) * partSize; j++)
{
Interlocked.Add(ref total, arr[j]);
}
});
return total;
}
Comparing it with the baseline and the previous interlocked implementation:
Method | Mean | Error | StdDev |
---|---|---|---|
RegularIteration | 69.43 us | 1.373 us | 1.969 us |
ThreadPoolInterlocked | 3,117.44 us | 13.520 us | 12.646 us |
ParallelFor | 3,028.13 us | 55.719 us | 52.119 us |
Note that Parallel.For
doesn’t promise
to run all the iterations in parallel. For example, if you were to set the number of parts to 100 and the PC only has 8 cores it’s impossible. It does optimize to run in parallel as much as possible. You can actually set the maximum number of “parallelism” with an additional configuration like this: Parallel.For(from, to, new ParallelOptions() {MaxDegreeOfParallelism = 4}, ...
Alright, so while the syntax is much nicer with Parallel.For
, it seems to be just slightly more efficient than manually queuing thread-pool threads. And still far less efficient than just iterating over the array and summing up items.
Eliminating Suspects
The biggest suspect to hurt performance is the Interlocked.Add
operation. Let’s just comment it out and see what changes. Instead of Interlocked.Add(ref total, arr[j])
I replaced it with one line that reads from the array x = arr[j]
. Here are the results:
Method | Mean | Error | StdDev |
---|---|---|---|
RegularIteration | 57.76 us | 0.8526 us | 0.7976 us |
ThreadPoolInterlocked | 34.21 us | 0.3934 us | 0.3487 us |
ParallelFor | 33.74 us | 0.6407 us | 0.7379 us |
Alright, now we’re getting somewhere here. Without the Interlocked
line of code, the parallel code runs almost 2 times faster with Parallel.For
and the thread-pool strategy.
Let’s do some brainstorming. How can we sum the elements of an array without adding an Interlocked
expression?
In the specific case of summing array elements, it can be easily done with per-thread locals. We’ll have a thread-sum for each thread and just add them up in the end. Parallel.For
even has a helpful functionality for that. There’s an overload with a parameter called localFinally.
Parallel.For with local thread totals
The idea is that each iteration will return a value. An additional delegate Action<tvalue></tvalue>
will execute at the end of each iteration, with the value as a parameter. In our case, there are 8 iterations, 1 for each thread. Each iteration will return a localTotal which represents the sum total of elements in that part of the array. And the localFinally will sum those up. Here’s the code:
[Benchmark]
public int ParallelForWithLocalFinally()
{
int total = 0;
int parts = 8;
int partSize = ITEMS / parts;
var parallel = Parallel.For(0, parts,
localInit:() => 0L, // Initializes the "localTotal"
body:(iter, state, localTotal) =>
{
for (int j = iter * partSize; j < (iter + 1) * partSize; j++)
{
localTotal += arr[j];
}
return localTotal;
},
localFinally:(localTotal) => { total += localTotal; });
return total;
}
Now when running this against the simple regular-iteration sum:
Method | Mean | Error | StdDev |
---|---|---|---|
RegularIteration | 66.12 us | 0.8337 us | 0.7391 us |
ParallelFor | 3,058.63 us | 58.6600 us | 65.2005 us |
ParallelForWithLocalFinally | 36.23 us | 0.6024 us | 0.5635 us |
Great success! We were able to minimize processing time by about 50%. Not as much as the number of cores, but still, mission successful.
Going Deeper
There are still some questions unanswered. Why is the parallelism 2 times faster and not 8, like the number of my cores? What happens with bigger and smaller arrays? Does CPU memory cache play any role in this?
Maybe some more benchmarks will help us understand. Let’s try with a different amounts of items:
Method | ITEMS | Mean | Error | StdDev |
---|---|---|---|---|
RegularIteration | 5000 | 3.312 us | 0.0642 us | 0.0789 us |
ParallelForWithLocalFinally | 5000 | 9.630 us | 0.2453 us | 0.7234 us |
RegularIteration | 10000 | 8.157 us | 0.1588 us | 0.2121 us |
ParallelForWithLocalFinally | 10000 | 12.705 us | 0.2319 us | 0.2055 us |
RegularIteration | 30000 | 19.76 us | 0.2826 us | 0.2505 us |
ParallelForWithLocalFinally | 30000 | 16.90 us | 0.2695 us | 0.2521 us |
RegularIteration | 100000 | 65.61 us | 0.9373 us | 0.8768 us |
ParallelForWithLocalFinally | 100000 | 35.58 us | 0.6241 us | 0.5838 us |
RegularIteration | 250000 | 163.55 us | 1.910 us | 1.693 us |
ParallelForWithLocalFinally | 250000 | 75.91 us | 1.679 us | 1.866 us |
So in our scenario, at about 20,000 elements, the regular-iteration strategy is as effective as the parallel strategy. As the number of elements grow, the parallel strategy becomes more effective. At 250,000 elements, the parallel strategy is about 2.2 times faster than regular iteration strategy.
Before drawing any conclusions, let’s do another experiment. This time, instead of summing the elements as they are, we’ll change to string and then parse them back to int. This operation is much heavier and should take more time.
public double RegularIteration()
{
double total = 0;
for (int i = 0; i < ITEMS; i++)
{
total += int.Parse(arr[i].ToString());
}
return total;
}
The results this time are:
Method | ITEMS | Mean | Error | StdDev | Median |
---|---|---|---|---|---|
RegularIteration | 1000 | 153.9 us | 3.021 us | 3.928 us | 153.0 us |
ParallelForWithLocalFinally | 1000 | 104.5 us | 3.240 us | 9.554 us | 107.6 us |
RegularIteration | 5000 | 781.1 us | 15.340 us | 25.203 us | 780.6 us |
ParallelForWithLocalFinally | 5000 | 412.2 us | 12.422 us | 36.625 us | 410.3 us |
RegularIteration | 10000 | 1,551.9 us | 29.551 us | 34.031 us | 1,548.4 us |
ParallelForWithLocalFinally | 10000 | 763.4 us | 21.766 us | 64.179 us | 769.7 us |
RegularIteration | 100000 | 17,349.3 us | 748.256 us | 2,206.251 us | 16,212.8 us |
ParallelForWithLocalFinally | 100000 | 6,238.6 us | 123.227 us | 280.650 us | 6,261.2 us |
As you can see, in this case, the parallel solution is as much as 3 times faster than the regular-iteration solution for 100,000 elements. And it starts being more effective for smaller arrays, with as much as 1000 elements.
Conclusions and Summary
I admit that instead of taking the direct route from the challenge to the
- Optimization with Parallelism can definitely improve performance, but it can easily hurt it as well. It depends on a lot of factors and each case should be measured and checked.
- When the various threads need to rely on each other with some kind of locking mechanism, the performance can degrade significantly.
- As the last benchmark proved, each use-case is different. For example, in the regular benchmark, parallelism became effective on more than 20,000 array elements. However, in the more complex scenario where we parsed integers to string and back, parallelism was effective even for 1000 elements.
Interlocked
works well according to process bitness. In other words, don’tuse Interlocked
for long
variables in 32-bit processes.
Another thing to consider is CPU Cache . It’s hard to measure how exactly it affects the performance, but it’s important to understand that caching happens under the hood without our control or knowledge and it’s more effective without parallelism. The fastest L1 cache is not shared between CPU cores. The mid-level L2 cache is usually shared between 2 CPU cores and only the slowest L3 cache is shared between all the cores. Note that even the “slow” L3 cache is much faster than RAM.
So the article is at an end and I still didn’t figure out why the optimization wasn’t more successful. I did learn a few things and I hope you did too. Feel free to share any insight in the comments section.