Server performance problems can happen for many different reasons. Memory issues, slow database requests, and too few machines are just some of them. I witnessed my fair share of problems and learned a few tricks along the way. In this article, I’ll tell you about 10 types of issues that can cause performance problems in your server. That’s not to say I categorized all possible problem types, but these might give you some ideas and nudge you in the right direction next time you’re digging into perf matters.
Here they are, in no particular order:
1. Slow database calls
A fast interaction with your database is probably the single most important thing for good performance. At least in most applications. Unfortunately, there are lots of things that can go wrong, and even innocent-looking implementations can cause problems. Here are some issues that can originate slow database requests and ruin your application’s performance:
- Bad indexing strategy
- Bad schema design
- Work done on the server instead of on the database. Like this:
1234567// Goodvar adults = dbContext.Users.Where(user => user.Age >= 18);var count = adults.Count();// Badvar adults = dbContext.Users.Where(user => user.Age >= 18).ToList();var count = adults.Count;
.ToList(), which tells Entity Framework to execute the query right then. As a result, all users are retrieved from the database and then counted in the server process. In contrast, in the first case, the SQL query will include a
COUNToperation, and the work will be done on the database side.
- Database is far away from the server. It’s best to place the database geographically close to your servers, optimally in the same data center. Why pay the ping duration price on every request?
- Using your database in a way it wasn’t built for. Not all databases are the same. Some are ideal for key-value storage, others are great for transactions, others still are perfect for storing logs. Use the best database for your needs. For example, a document-based database like MongoDB, isn’t great with
JOINoperations, and using them can hurt performance. But it’s great for storing documents with a lot of data. So one solution might be to duplicate information between documents instead of using a
JOIN(use with care).
- The database doesn’t have enough resources. While scaling of your servers is obvious, don’t forget that databases need to scale as well. In Azure SQL Server, for example, you’ll have to keep an eye on DTUs (Database Transaction Units). In other databases, you’ll want to keep an eye on Storage, RAM, Network, and CPU. In my experience, you don’t always get a nice and clear alert when getting close to the limit. Things just start to get slow, leaving you wondering what the heck happens.
- Inefficient queries are always a possibility. If you’re using Entity Framework, the generated SQL isn’t always optimal.
- Needing to re-establish the connection every time. If your DB connections aren’t properly pooled, you can find yourself re-establishing connections for each query.
- Consider stored procedures when a complicated query takes a lot of time. Use with care.
- Bad Sharding strategy. Take care to group related data in the same shard or you’re risking querying multiple shards in the same request.
The hardest part of solving these problems is to identify them in the first place. There are many tools to see how your requests perform in production. Usually, the database itself will be able to show slow queries, scaling problems, network reaching its limits, etc. APM solutions like Application Insights show this very well. It’s also pretty simple to add request execution time to your logs and build queries and automation around that.
2. Memory Pressure
One of the most common offenders in high-throughput servers is memory pressure. In this state, the garbage collector doesn’t keep up with memory allocations and deallocations. When the GC is pressured, your server spends more time during garbage collection and less time executing code.
This state can happen in several cases. The most common case is when your memory capacity runs out. When you’re reaching the memory limit, the garbage collector will panic and initiate more frequent full GC collections (those are the expensive ones). But the issue is why this happens in the first place? Why is your memory reaching close to its limit? The reason for that is usually poor cache management or memory leaks. This is pretty easy to find out with a memory profiler by capturing a memory snapshot and checking what’s eating up all the bytes.
The important thing is to realize that you have a memory problem in the first place. The easiest way to find that out is with performance counters.
3. No caching
Caching can be a great optimization technique. The canonical example is that when a client sends a request, the server can save the result in cache. When the client sends the same request again (might be a different client or the same one), the server doesn’t need to query the database again or do any sort of calculation to get the result. It just retrieves it from cache.
A simple example of this is when you search for something on Google. If it’s a common search, it’s probably being asked for many times each day. There’s no point to re-do whatever magic Google is doing to get the first page with the same 10 results. It can be retrieved from the cache.
The tradeoff is that cache adds complexity. For one thing, you need to invalidate that cache every once in a while. In the case of Google search, consider that when searching for the news, you can’t return the same result forever. Another issue with cache is that if not managed correctly, it can bloat and cause memory problems.
If you’re using ASP.NET, there are excellent cache implementations that do most of the work for you.
4. Non-optimal GC mode
The .NET garbage collector has two different modes: Workstation GC mode and Server GC mode. The former is optimized for a quick response with minimal resource usage and the latter for high throughput.
The .NET runtime sets the mode by default to Workstation GC in desktop apps and Server GC in servers. This default is almost always best. In the case of a server, the GC will use much more machine resources but will be able to handle a bigger throughput. In other words, the process will have more threads dedicated to garbage collection and it will be able to deallocate more bytes per second.
For whatever reason, your server may work in Workstation mode, and changing to Server mode will improve performance. In rare cases, you might want to set the server GC mode to Workstation, which may be reasonable if you want the server to consume fewer machine resources (CPU & RAM).
5. Unnecessary client requests
Sometimes, it’s possible to significantly reduce the number of client requests. Reduce that number, and you can have fewer server machines or just less load for the existing ones. Here are a few ways to do that:
- Throttling – Consider the auto-complete mechanism when searching in Google. When you start typing letters, google will show a drop-down with the most common searches starting with those letters. To get those auto-complete values, Google has to retrieve them from a server. So let’s say you’re typing “Tabs vs spaces”. Google can send 15 requests to its server—on “T”, “Ta”, “Tab”, and so on. But it doesn’t need to. It can implement a simple throttling mechanism that waits until you stop typing for 500 milliseconds and then send a single request.
- Client-side caching – Continuing with our Google search auto-complete example, there are a lot of searches that start with the same words. e.g “Why is”, “Should I”, “Where are”, etc. Instead of sending requests for those every time, Google might save the most common auto-complete results on the client-side in advance, saving unnecessary requests.
- Batching – Let’s assume for a second that Google spies on user activity to take advantage of personalized data (preposterous?). When using Gmail, it might want to send telemetry data every time you’re reading an email and hovering with the mouse on a certain word. Google can send a request for every such occurrence, but it will be more efficient to save a bunch of those occurrences and then send them in a single request.
6. Request hangs
In certain conditions, requests become hung. That is, you send a request but never receive a response. Or rather you eventually receive a timeout response. This can happen, for example, if there’s a deadlock in the code that handles the request. Or if you have some kind of an infinite loop, which is called a CPU-bound hang. Or if you’re waiting indefinitely for something that never comes—like a message from a queue, a long database response, or a call to another service.
Under the hood, when a request hang happens, it hangs one or more threads. But the application will keep functioning, working with other threads for new requests. Assuming this hang reproduces on additional requests, more threads will hang over time. The consequences depend on the cause of the hang. If it’s a CPU-bound hang, like an infinite loop, the CPU cores will max out pretty quickly, which will make the system crawl, resulting in very slow requests. Eventually, the IIS will start returning 503 error responses (Service Unavailable). If the cause is a deadlock, then it will gradually lead to memory and performance issues as well, and eventually to the same results—very slow requests and 503 errors.
So request hangs can be pretty devastating to your server’s performance if they keep happening.
The solution to this is to solve the core cause of the problem. You’ll have to first identify that there are indeed hangs and then take steps to debug those hangs.
7. Server crashes
Like hangs, crashes can manifest as a performance problem.
When does an ASP.NET server crash though? When a regular exception happens during a request, the application won’t crash. The server returns a 500 error response, and everything continues as usual. But a crash might happen if an exception happens outside of a request context, like in a thread you started yourself. Other than that, there are catastrophic exceptions like
ExecutionEngineException, and my favorite a
StackOverflowException. Those will crash the process no matter how many
catch clauses you place.
When an ASP.NET application that’s hosted in IIS crashes, the server will be down temporarily. IIS will perform an application pool recycle, which will restart your server and return to business to usual. The effect for the client will be temporary slow requests or 503 errors.
Depending on your application, one crash might not be the end of the world, but repeated crashes will make the server very slow, concealing the real reason as a performance problem. The solution to this is, of course, to deal with the root cause of the crash.
8. Forgetting to scale
This problem is pretty obvious but I’ll mention it nevertheless. As your application usage starts to grow, you have to consider how to handle a bigger throughput.
The solution is to scale of course. There are two ways you can do that—vertical scaling (aka scaling up) and horizontal scaling (aka scaling out). Vertical scaling means adding more power to your machines like more CPU and RAM, while horizontal scaling means adding more machines.
Cloud providers usually offer some kind of easy automatic scaling, which is worth considering.
9. Major functionality around every request
It’s pretty common to decorate your requests with additional functionality. These might come in the form of ASP.NET Core middleware or Action filters. The functionality might be telemetry, authorization, adding response headers, or something else entirely. Take extra notice of these pieces of code because they are executed for every request.
Here’s an example of something I experienced myself. The server in question included a middleware that would check on each request if the user had a valid license. This involved sending a request to Identity Server and a database call. So each request had to wait for those responses, adding a bunch of time and adding more load on both the Identity Server and the database. The solution was a simple cache mechanism that kept the license information in-memory for a day.
If you have similar functionality, cache might be an option. Another option may be to do things in batches. e.g. log telemetry every 1000 times instead of every single time. Or maybe place messages in a queue, turning this functionality to asynchronous.
10. Synchronous vs Asynchronous
Whenever your server sends a request to some service and needs to wait for a response, there’s a risk. What if that other service is busy, handling a big queue of requests? What if it has a performance problem that you have to transitively suffer as well?
The basic pattern to deal with this is to change the synchronous call to asynchronous. This is usually done with a queue service like Kafka or RabbitMQ (Azure has queue services as well). Instead of sending a request and waiting for a response, you would send a message to such a queue. The other service will pull these messages and handle them.
What if you need a response, though? Then instead of waiting for one, the other service will send a message with the response to the same queue. You’ll be pulling messages from the queue as well, and when the response arrives you can handle it as needed, outside of the context of the original request. If you need to send the response to a client, you can use push notification with something like SignalR.
The nice thing about this pattern is that the system components never actively wait for services. Everything is handled asynchronously instead of synchronously. Another advantage is that services can be much more loosely coupled with each other.
The drawback is that this is much more complicated. Instead of a simple request to a service you need to introduce a queue, pushing and pulling messages, and dealing with things like service crashes when a message was pulled but not yet handled.
Many things can mess up your server’s performance and there’s a lot of place for errors. I think there are no tricks and shortcuts to building a fast and robust system. You need careful planning, experienced engineers, and a big buffer of time for things that will go wrong. And they will, which requires the next most important thing: debugging those problems. Detecting the issue and finding the core cause is usually 90% of the work. In case of server problems, many tools can help like Performance Counters, APM tools, performance profilers, and others. If you want to find out more, you can check out my book: Practical Debugging for .NET Developers that explains how to troubleshoot performance problems.
That’s it for now, cheers.
Want to become an expert problem solver? Check out a chapter from my book Practical Debugging for .NET Developers