Performance Showdown of Producer/Consumer (Job Queues) Implementations in C# .NET

Job Queue Performance showdown

I recently wrote 3 blog posts ([1] [2] [3]) on different Producer/Consumer (Job Queues) implementations. There are a lot of great different ways to implement Job Queues in C#, but which to choose? Which one is better, faster and more versatile?

In this article, I want to get to the point where you can make a confident decision on which implementation to choose. That means checking performance and comparing customization options.

The implementation we covered were:

  • Blocking collection Queue (Part 1)
  • Thread-pool on demand (aka no-dedicated-thread-queue) (Part 1)
  • System.Thread.Channels (Part 2)
  • Reactive Extensions (Part 2)
  • TPL Dataflow (Part 3)

And we’re going to do the following tests:

  • Compare performance of single job to completion
  • Compare performance of 100,000 jobs-to-completion
  • Compare available customizations

To make matters simple, I’ll use a basic implementation of each type, with a single thread handling the jobs.

The Code

This code is for the simplest implementation of each type:

BlockingCollection Queue:

Thread-pool on demand (aka no-dedicated-thread-queue):

Reactive Extensions (Rx):

System.Threading.Channels Queue:

TPL Dataflow Queue:

First Benchmark: Time to getting a single job done

The first thing I want to measure is initializing the Job Queue, enqueuing one job, wait for it to finish, and complete the queue. It’s easy to do with the following code:

For all Benchmarks, I use the excellent BenchmarkDotNet library. My PC is: Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores. The host is .NET Framework 4.7.2 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.8.3745.0.

The last method DoOneJob is the interesting one. I use an AutoResetEvent to signal the job was done and stop the job queue.

The results are:

BlockingCollectionQueue215.295 us4.1643 us5.4148 us
NoDedicatedThreadQueue7.536 us0.1458 us0.1432 us
RxQueue204.700 us4.0370 us5.6594 us
ChannelsQueue18.655 us2.0949 us1.8571 us
TPLDataflowQueue18.773 us0.4318 us1.2730 us
The measuring unit ‘us’ stands for microseconds. 1000 us = 1 millisecond
Thanks to Azik and rendlelabs for correcting my System.Threading.Channels implementation.

As you can see, NoDedicatedThreadQueue is fastest, which is no wonder because it does the bare minimum.

The second and third fastest are TPLDataFlowQueue and System.Threading.Channels, about 12 times faster than the other implementations.

The most important thing to note here is that creating new Job Queues usually happens rarely, maybe once in an application lifespan, so 200 microseconds (1/5 of one millisecond) is not much.

Second Benchmark: Getting 100,000 jobs done

Initialization can happen only once, so the real test is to see if there’s any substantial difference when dealing with high-frequency jobs.

Testing this benchmark can be done in a similar matter as before with the following code:

The results for 100,000 jobs were:

BlockingCollectionQueue23.045 ms0.5046 ms0.4473 ms
NoDedicatedThreadQueue7.770 ms0.1553 ms0.1964 ms
RxQueue10.478 ms0.2053 ms0.3430 ms
ChannelsQueue5.661 ms0.9099 ms2.6687 ms
TPLDataflowQueue6.924 ms0.1334 ms0.1310 ms

System.Threading.Channels is in first place with 5.6 milliseconds. TPL Dataflow is (surprisingly) second place with 7.7 milliseconds, gaining on No-Dedicated-Queue by 10%.

BlockingCollection is slowest with 23 milliseconds, 4 times slower than Channels.

In many cases, these performance differences will not matter because the Job Queue time will be negligible in comparison to the job execution time. However, this can be important when you’re dealing with high-frequency short execution jobs.

Showdown Summary

Summing things up from the benchmarks, here’s a visualization:

The fastest overall implementations turned out to be System.Threading.Channels, no-dedicated-thread-queue, and TPL Dataflow.

Performance is not always the most important factor though. Perhaps, more important than speed, each type of implementation allows natively (with relative ease) a bunch of customization you might want for your specific application. Here are some common Job Queue variations:

  • Handling jobs in multiple threads, instead of just one thread
  • Prioritizing jobs
  • Having different handlers for different types of job (publisher/subscriber)
  • Limiting Job Queue capacity (Bound capacity)

You can’t do any customization with any implementation. Not with reasonable effort anyway. That’s why choosing an implementation will always have to be done according to your needs. Here’s a summary on which supports what:

Producer consumer customization table
  • * Priority Queue is possible by combining with BlockingCollection or by having a finite number of priority levels.
  • ** Publisher/Subscriber is possible by adding a casting wrapper around each Job.

To see how I constructed this table, you can read the original articles (Part 1, Part 2, and Part 3).

As you can see, there’s no clear winner when it comes to customization. So the decision on which producer/consumer implementation to choose is always “It depends”.

This is it for my Job Queue series, hope you enjoyed it. Any feedback in the comments section is welcome. I’ll probably write similar posts with other patterns like the Pipeline pattern in the near future, so stay tuned. Cheers.


Enjoy the blog? I would love you to subscribe! Performance Optimizations in C#: 10 Best Practices (exclusive article)

7 thoughts on “Performance Showdown of Producer/Consumer (Job Queues) Implementations in C# .NET”

  1. There are multiple issues with your benchmark but the most important one is that the TPL Dataflow’s ActionBlock isn’t faster than Threading.Channels, it’s your test that is wrong:

    The ChannelsQueue’s code:

    1. public async void Enqueue(Action job). Why did you put async in there? If you simply remove it it will double performance.

    2. You should be doing this:
    while (_reader.TryRead(out var job))
    instead of var job = await reader.ReadAsync();

    3. There is no need to user LongRunning option. In fact it adds 500ms delay and affects your metrics because IJobQueue implementation creation is inside your benchmark method. It should be created just once to allow warmup to actually warmup.

    4. You need to stick new UnboundedChannelOptions() { SingleReader = true } to your channel creation as this is what TPL Dataflow will do with MaxDegreeOfParallelism set to 1 by default.

    5. Your test won’t work with MaxDegreeOfParallelism greater than 1 because you rely on processing order. You should be using CountDownEvent or simply inc/dec a value and pool to make it work.

    6. Why are you using 32bit legacyjit? Why not to run tests on different version at least?

    Hope this helps.

  2. And in your “OneJob” test you aren’t measuring latency but object creation time hence thread pool based ones take nothing compare to the ones that need to create a thread (~200ms overhead in your case)

    Rx is genuinely slow though. šŸ™‚ You just need to add Buffer with some big number to ensure it actually queues messages rather than simply passing them by for fairness but then it will be even slower.

  3. Maybe you can also add Hangfire to this test?
    I wanted to do so, but struggling with it, it stucks at “OverheadJitting”

Comments are closed.