I think we can all agree that logs are pretty damn important. Since the beginning of time (well, of software development), logs offered an invaluable service: an insight into the black box that is your program. They can tell a story of the program’s usage, a report of all its errors, and a glimpse of its performance under the hood. Somehow, all the remote debuggers and observability tools in the world weren’t quite able to replace logs. But I’m here to ask you if there’s even more you can do with your logs. Are you sure you’re using logs to their maximum potential?

Once your application scales from single users to hundreds of users, and then to tens of thousands, the ways to get information about your application’s performance change drastically. Relying on customer feedback gives you less insight since there are too many of them to ask. Any sort of manual work like looking at production logs or manually testing a production environment becomes unreliable. Everything comes down to automated processes, and logs are one of the best tools for that. They are still the easiest source of information you have from a production environment.

In this article, we’ll go over 6 types of information you can get by correctly utilizing logs. All of them are very different but ultimately provide observability into your application’s performance and usage. I’ll share some tips from my experience and best practices in logging, including structured logging and how powerful query languages can be.

1. Debugging

The most common usage of logs is probably debugging and troubleshooting. If you had an issue in production, the first thing you can do is look at your logs. If your customer complains about something, logs might be the only information you have. The same goes for bugs that don’t reproduce consistently.

You do have to make some effort to be able to debug with logs effectively. Here are some basic tips:

  • Make sure to log all errors. Those should include exceptions, failed HTTP requests, logical problems, asserts, and crashes. Crashes are especially important but might be hard to log. Some runtimes do provide a way to register to unhandled exceptions and send logs just before a crash . If not, you might have to create a monitoring service that waits for your app to crash and logs the exception details.
  • Use session IDs. Those are invaluable to go over the sequence of events or correlate logs to a customer complaint. Each log message should contain a session ID property. And while you’re at it, add a way for your customer to report a problem from within the app. The report button should—you guessed it, send a log with a complaint message and a session ID.
  • Use correlation IDs for requests. Those are different than session IDs. A session stays the same from the time your user starts using the application until they closed it or stopped. It describes a period of usage. Whereas a correlation ID is unique for each request to the server. But it stays the same when the server calls another service, and for all following requests from within the context of the original request.
  • Have a unified logging system between all your services and the client. You don’t want to have to merge logs from different sources every time you’re trying to figure out an issue. While you’re at it, make sure the client and all server-side services log the session ID and correlation ID.
  • You’ll want to log important state changes in the application or request. This will help later when trying to figure the sequence of events while going over the logs. Usually, you’ll get those logs as a side effect when logging for telemetry.
Tip: To make sense of logs at scale, you’ll have to use structured logging . That’s when you log a bunch of key/value fields instead of a plain text string. It’s the only sane way to analyze logs at scale. Those fields allow you to add data on each log that you can later filter on, sum, join, or whatever. That might be the session ID, browser type, server name, execution time metric, etc.

Here’s a Kusto query to get all logs from a single session.

Logs | where SessionID == "f6727475-0e5c-4bfc-8fab-6a6242a94198"

We’ll see later how to use more interesting queries to get invaluable insights from logs.

2. Telemetry

As applications become bigger and bigger, figuring out what happens to our millions of users becomes very hard. We need ways to gather information and somehow aggregate all that data into something cohesive. That’s why we need telemetry. Telemetry is data on what features our users use, how long they’re using them, how many users we even have, how often do users use button A instead of button B, etc.

The term Telemetry is sometimes used for any data that you gather from production, including logs, performance metrics, usage statistics, and so on. This is a bit confusing, so in this article, I’ll consider telemetry to be just data on how consumers use your application or service.

You can imagine this kind of information is invaluable to a company. If your users aren’t using some feature, you might need to make it more discoverable or remove it. If one item is more likable than the other, you might want to promote it. If the users are searching for something but having to scroll quite a bit to find it, the user experience is bad, and you’ll want to improve it.

Logs are perfect for that kind of analysis. Here’s a simple example of a query that shows feature usage. Entry “qb2ase” is logged when feature A is used.

// percent of time feature A was used out of all sessions in the last day
Logs
| where Time > ago(1d) // since 24 hours ago
| summarize 100.0 * dcountif(SessionID, EntryID == "qb2ase") / dcount(SessionID) //unique session Ids where there was a specific entry, divided by all unique session IDs in the given time range

Tip: Assign a unique ID to each log entry as a structured field. You can automate this somehow. This will make your life much easier when querying and analyzing logs.

These examples are simple but you can imagine that you can get very sophisticated insights with logs and a query language. Like the sequence of user actions in a certain screen and how common each combination is. Or how much time passes between user actions.

3. Performance & Metrics

Logs are great to measure performance. They can provide the most reliable information because the data is taken from a real-world environment.

With some effort, you can measure anything with logs, including function execution times, database request duration, HTTP request duration, UI render time, event handle time, etc.

There are APM solutions like Application Insights or DataDog that visualize all your requests, show interaction between micro-services, and work right out of the box. Those solutions are great, but in many ways, there’s just more you can do with raw data. And frankly, when your log infrastructure is mature enough, you might not need those other solutions.

Using logs for performance measurements is pretty straightforward. Here’s how you might measure a method execution time in C#:

public void MyFunc()
{
  var stopwatch = new Stopwatch();
  stopwatch.Start();
  try
  {
  	// ...
  }
  finally
  {
    stopwatch.Stop();
    var elapsedMs = stopwatch.ElapsedMilliseconds;
    _logger.LogInformation("afr7fst", "MyFunc time: {@Duration}", elapsedMs);
  }
}

After this method ran for some time in production, you can do something like this with Kusto:

// Find MyFunc execution time in the last 7 days, per day
Logs
| where Time > ago(7d)
| where EntryID == "afr7fst"
| summarize count(), percentiles(Properties.Duration, 50,75,95) by bin (Time, 1d)

The result might be:

Time count_ percentile_Duration_50 percentile_Duration_75 percentile_Duration_95
2021-03-06 00:00:00 125458 10 20 65
2021-03-07 00:00:00 256482 11 22 40
2021-03-08 00:00:00 206843 11 18 28
2021-03-09 00:00:00 199584 11 26 33

Looking at the 95th percentile, for example, we can see that on the first day, 95% of calls to that method ended within 65 milliseconds. Since there were 125,458 calls in total, about 6300 calls took longer than 65ms.

Tip: You can add these kinds of stopwatches in a middleware or a wrapper that will encompass every type of action. Like any HTTP request. Or any call to the database. Make sure to add data about the action to the log like the request URL or the database query (sanitized of course).

4. Alerts

As your application scales, alerts play an ever-increasing role. They are able to automatically find issues in the wild jungle of a deployed application. Without them, you’ll have to rely on customer complaints, manual QA, or god forbid software developer intuition.

So when would you want to get an alert? A few examples are when your application starts crashing a lot, when there are more server errors than usual, when memory or CPU reach some dangerous threshold, or when you start having an increasing amount of error logs. Once you log crashes, server errors, and memory & CPU spikes, you can easily build alerts from those. In fact, logs are probably the best tool for that because you’ll want to alert on regular ERROR level logs just like you’re alerting on CPU spikes and crashes.

Creating an alert should be pretty simple. Each bad occurrence should be logged as an ERROR or FATAL entry. Poll for all ERROR+ logs every hour or so. If there’s a spike in comparison to the last day/week/month, shoutout an alert.

Those alerts are most useful when accompanied by Blame information. Like which commit caused the issue or the date and time when the issue started happening. It’s even better if you’re deploying your experiments to a fraction of your customers, in which case, you’ll be able to build an automation that deducts which experiment caused the dangerous behavior.

5. Experiments & A/B testing

Let’s say you want to try out a new feature or a performance optimization. As it goes with software, there’s no way to tell anything without data from production. And logs are perfect to get that data.

To do A/B testing with logs, create a feature flag that includes both options, and randomly assign that feature flag to sessions or users.

Let’s take a simple experiment where I’m trying to introduce an optimization that should improve performance. I’m already logging the execution time that this experiment should improve in the log entry afr7fst. Here’s how I would check the result of my experiment in Kusto:

let startTime = ago(7d);//past week
let controlGroup = "MyOptimization-control";
let experimentGroup = "MyOptimization-experiment";
Logs
| where Time > startTime
| where EventID == "afr7fst"
| join (Experiments | where ExperimentName == controlGroup or ExperimentName == experimentGroup) on SessionID
| summarize count(), percentiles(Properties.Duration, 50,75,95) by ExperimentName

The result might be something like this:

ExperimentName count_ percentile_Duration_50 percentile_Duration_75 percentile_Duration_95
MyOptimization-control 107,941 5 12 38
MyOptimization-experiment 105,678 4 7 9

There you go, the optimization turned to be a great success.

Note that I used join with a table called Experiments. The idea is that you’ll create a table that will contain a mapping of all sessions and the experiments running in those sessions. This will allow to cross-reference logs to these experiments and find out information about performance optimizations, A/B testing, memory usage, new feature rollouts, etc.

6. Testing

I wasn’t sure whether to include this part because it smells just a bit.

Logs provide a very convenient way to add assertions to end-to-end testing that you would otherwise be hard-pressed to do. When you’re doing end-to-end tests, getting information from the black boxes you’re testing is hard. To be able to assert the program is running as you expect, you’ll have to change the code in places to output data just for the test’s sake. Logs give you a convenient way to avoid doing anything complicated. As part of the test, you can consume the logs and assert that the logs you expect are printed and the logs you don’t expect aren’t.

While very convenient, this makes the test depend tightly on the logs. The developers don’t usually assume logs affect any functionality and feel free to change the log contents or remove them altogether. Or rather stop using one function in favor of another, without duplicating the same log entries. I had that situation myself when I removed a couple of logs and got a dozen failed tests. It took me a long time to figure out the logs were at fault.

Still, if you have a strong suite of end-to-end tests, I would say it is worth it. Developers will adjust as we always do.

Finishing up

I hope I convinced you that logs have much more potential than just debugging a situation. They provide powerful observability features like telemetry, metrics, and alerts. Combining that with experiments gives you invaluable insight into the application.

Realizing that potential does require some effort. You’ll need to use structured logging, integrate to an infrastructure that allows powerful querying, develop automation for alerts, and introduce a mechanism for experiments. But it’s a small price to pay for the huge benefits to be had.

Note that some of the mentioned abilities can be replaced by out-of-the-box solutions like APMs and Production debuggers . These solutions aren’t an alternative exactly. They are complementary to logs. And even if you do use them, you’ll probably want to invest in your logging infrastructure as well.