9 Best Practices to Safely Deploy and Keep Your Application Healthy at Scale

Disclaimer: This blog post was written by a human, with no AI-generated text.

An application’s code base is a living entity. It keeps growing, changing, and adapting. There’s always a new feature to add, more bugs to solve, and new bugs that are created as a result. As the teams grow, the code changes more often and there are ever more features, more issues, and more bugs. Thorough manual testing becomes impossible the bigger your application gets and as you ship more frequently.

So how do you make sure your application stays healthy? How can you be confident that a deployment didn’t break something? The answer is we start to automate. We introduce tests, continuous deployment gateways, add dashboards, monitor logs, and so on. But what’s the cookbook for this process exactly? What are the best practices the big companies are using to make sure everything works well? This is what we’ll talk about in this blog post. We’ll see 9 best practices to be able to rapidly deploy changes and keep the application at the highest quality.

1. Use Feature Flags

When you’re a small startup with no customers, adding new features won’t interrupt anyone. As you get bigger, you’ll hopefully get some paying clients that you’ll want to keep. Feature flags mean a mechanism that allows you to turn off a functionality in production if something goes wrong. A virtual toggle that you can switch.

if (IsFeatureFlagEnabled("MyNewFeature", defaultValue: false))
{
	// ...
}

The actual feature flag values might be in some remote database or service that you can control. You can develop it or yourself or use a management tool, like LaunchDarkly , that does exactly that: stores your feature flag values, provides an API for your services to get them, and a front-end to be able to easily change them. As well as other useful things.

You can place feature flags on big changes or small ones. A feature can be protected by a “high level” feature flag that turns off everything and by “low level” flags that control smaller parts of the feature. If you want to be extra safe, you can introduce a policy that every code change should be protected by a feature flag.

Feature flags allow to develop code in the dark more easily. Instead of having a side branch that a development team keeps separate for months, work closely with your “main” branch, but keep the functionality turned off with a feature flag.

You’ll probably want to remove feature flags from code with time before your app becomes a bloat of condition statements.

2. Add telemetry to detect regressions

A key part of knowing your app works as it should is to have observability of what’s going on. Some glimpse under the hood to make sure there aren’t any fires or suspicious smells. You’ll want to know as soon as possible if something doesn’t work well. This kind of monitoring ability comes in many forms:

Application error logs - Add error logs generously throughout the application. If there’s an exception or unexpected behavior, go ahead and log it. A spike in an error log count quickly shows there’s a problem and pinpoints the root cause.
Execution times telemetry - One of the best indicators that an application is working well is its execution times. Those might be operation execution times, request execution times, or anything else that you can measure. They are easy to monitor and you can quickly detect anomalies like hanging requests and slow performance. Faster than usual execution times can also indicate something went wrong.

You can log execution times yourself with application logs or use performance counters and Application Performance Monitor tools (e.g Azure Monitor ).
Crash telemetry - Crashes are bad and you want to know about them as soon as possible. A robust system to report crashes is going to go a long way. You can develop your own watchdog services or use crash-reporting tools like Raygun .
Build application health dashboards - Dashboards are great because, well, a picture is worth a thousand words. A simple dashboard can show if something went wrong at a glance. You can display request execution times, CPU & memory usage, fail rate, crash rate, etc. The more the merrier. If you can visually see something spiking or dipping, you’ve successfully detected a problem that you can fix before it severely impacts your business. There are great tools to build dashboards, including Azure Data Explorer , Grafana , and DataDog .

3. Add telemetry alerts when something goes wrong

Adding logs and dashboards is great, but relying on someone to always be looking at them is destined to fail. The best way for observability to work is to automate anomalies. Automate your dashboards to send an alert notification when something goes wrong. If there’s a major slowdown in request durations, you’ll want to know about it ASAP. The same goes for an increase in error logs and process crashes. Pretty much anything you’re bothering to monitor is worth automating for notifications in case of a problem. Most tools that provide reporting and dashboard visualizations also have alert functionality, including Kibana , Azure Monitor , and Datadog .

4. On-call engineers

Adding telemetry, customer feedback, and dashboards is not very useful if nobody’s looking at them. As companies grow, they usually start some kind of an on-call rotation policy. It’s as simple as assigning weekly or monthly shifts among your engineers. During a shift, the on-callee is responsible to sign up for any relevant alerts and notifications, actively looking at application health dashboards, and respond to anomalies. Usually, the on-call engineer won’t be the one that fixes the problem, but rather the one to assign it to the relevant developer and mitigate it by turning off some feature gate or maybe restarting some server.

You might take the alternative route of permanent reliability engineers that act as watchdogs. I prefer the rotation approach, both to get the developers closer to the production front lines and to have the same people that monitor the app health be the same people that fix the problems. It might add to their sense of responsibility.

5. Add an easy way for customers to provide feedback

Your customers, besides paying your bills, can be great QA engineers. Add an accessible way in your application to provide feedback. Make sure you can correlate the relevant application logs to the user’s feedback ticket. Besides free quality assurance, enabling a way to give feedback is a great experience for the customer.

Oh and once you have a pipeline to get customer feedback, make sure to add a feedback counter as telemetry, create a dashboard, make an alert, and have the on-call engineer receive notifications in case of a spike.

6. Use canary releases

Once your application grows enough, you can no longer afford to deploy changes to all your customers, no matter how protected it is by feature flags and great telemetry. The risk is too big. To minimize this risk, it’s standard practice to use canary releases. Instead of pushing new code to everyone, it gets pushed gradually through rings of customers. First, new changes are deployed to the first ring, which is a small subset of users, maybe an internal dogfooding environment. Then, the code is pushed to the second ring, which is a bigger subset of your users. Every time the code is deployed to a larger ring, you should monitor the application logs, dashboards, and customer feedback to make sure nothing broke. If something did break, you can catch it in time with minimal effect on your customers.

7. Add experiments and A/B testing

When adding some change, like a new feature, or a UX change, an A/B test is great to make sure the change is positive and didn’t introduce any problems. Add a mechanism to split your users into 2 or more groups and run different code for them. Monitor both halves of the experiment: the control group and the treatment group (or group A and group B). Are there any increases in error logs? How about spikes in customer complaints? Check out your dashboards, are there anomalies?

Besides the underlying safety in such experiments, they are great to deduct problematic changes. As long as you got a good way to correlate telemetry to experiment groups. It’s as easy as logging all active experiments on session start and then correlating those experiments with problematic sessions.

LogsTable
| where EventID == ProblematicEvent
| distinct SessionID
| join kind=inner (ExperimentTable 
    | where (Experiment == "newFeature-treatment" or Experiment == "newFeature-control")) on SessionID
| summarize count() by Experiment

Read more about using Kusto Query Language in Maximizing the power of logs as your application scales .

8. Do passive tests

Passive tests are a great way to make sure your changes behave correctly in a big environment that you don’t fully understand. Consider that when you make some optimization or a refactor in a huge application, you can’t predict with full confidence that you didn’t break something. Sure, you’ve got your test suite, but you never know for sure it covers all cases. Instead, you can do a passive test. Do whatever optimization or change you need in the dark, working passively, in addition to the original implementation. Run both the old way (actively) and the new way (passively) and then compare the results in logs. Roll it all the way to production and make sure the behavior matches between the original and changed code. Once you see it matches perfectly, you can make the switch from the old to the new in confidence.

9. Test

I left tests for last because they are kind of obvious, but still, they are important and need to be mentioned.

One of the most important mechanisms to make sure the next deployment didn’t break anything significant is a test suite. The first step for that safety net is to have tests. This should be standard practice these days, but every once in a while I still encounter projects without them. Once you have tests, make sure to enforce them, which usually means adding a policy in pull requests that makes sure all tests pass.

When you have a continuous integration process, it’s time to consider your test strategy. Are you aiming for specific code coverage? Do all new features require some amount of tests? It’s hard to enforce these kinds of rules automatically, but it’s still healthy to define your team’s policy and try to follow it yourself and as a code reviewer.

Besides unit tests, I like having a decent amount of integration tests and end-to-end tests. Those won’t pinpoint the root cause if the test fails, but they are great at detecting bugs and making sure your system is working as a whole. Unit tests tend to break easily when you refactor or change behavior, even though the system works as expected, whereas end-to-end tests usually remain intact if you didn’t introduce bugs. I’d go as far as to suggest removing some or most unit tests after you’ve finished development and leaving only integration and E2E tests. Or at least be flexible as to allow removing unit tests with time in favor of those wider scope tests.

Finishing up

We talked about 9 best practices to deploy an application safely. Once you have practices that provide confidence your app works well, and that your clients are happy, then there’s no need for worry. Now, all that remains is scaling your application and making enough money to the level that all you have to worry about is charts and dashboards. So go write some code. Cheers.