Flying Slower is Better Than Flying Blind
If your code doesn’t work, no one cares how quickly it doesn’t work
Proper instrumentation is the only way to up the number of 9’s in your reliability, understand your customers, and reduce the misery of on call. Data removes the guess work from solving problems, but only if it’s done correctly.
Telemetry is cheap
Before getting into the details, let’s set the record straight. You can’t afford not to instrument your service. Yes, the databases involved are expensive. Some of your compute time on production nodes will be spent as additional overhead. But this is one of the world’s truest false economies.
A recurring theme is these posts is to beat two facts into people’s heads:
- There is no free lunch. Opportunity cost exists always and everywhere.
- Developer time is extremely expensive. That’s true when writing code and when operating it as a service.
If you don’t properly instrument your service, you will invariably waste far more money in developer time than the instrumentation costs. More tickets get created, fewer of them heal or route via automation, more time is spent scratching heads, trying to infer things, running experiments.
Telemetry Musts
These are the table stakes for running any production service of importance or scale.
Use structured logs
It isn’t 1982 anymore. Human centric, full sentences with interpolated values are a miserable way to capture logs. Modern telemetry first and foremost is designed to be machine readable, because we don’t interact with logs through text files in healthy environments. They are injected into a database that can be queried. With human facing logs, complex queries need to be written to extract values. This is tedious and computationally expensive. Moreover small changes to the logs can break queries and having all the data embedded in strings precludes the database from indexing the data as well as it otherwise could.
Don’t write logs like this:
1
2
3
[2024-08-28 15:05:32.273] [Info] Got request #4222 GET /foo/bar/35
[2024-08-28 15:05:32.282] [Warning] 35 is not a valid product ID in request #4222!
[2024-08-28 15:05:32.295] [Info] Returning 404 Not Found for request #4222. Took 22ms to process
Write logs like this (pretty formatted here for readability):
1
2
3
4
5
6
7
8
9
10
11
12
{
"name": "Server.ProcessRequest",
"level": "info",
"id": "7d49222c-36dd-4eae-98e1-f3e58928390a",
"duration": 22,
"attributes": {
"id": 4222,
"method": "GET",
"path": "/foo/bar/35",
"responseStatusCode": 404
}
}
1
2
3
4
5
6
7
8
{
"message": "Not a valid product ID",
"level": "warn",
"span": "7d49222c-36dd-4eae-98e1-f3e58928390a",
"attributes": {
"productId": 35
}
}
Distinguish between events and spans
In the example above, we distinguished between events and spans.
Spans are used to track an operation over time where there is start time and an end time. Events are instantaneous logs that occur within the context of a span and are linked. Processing a request from a client, reading a file from disk, running a database query, these are all spans. In the database query span, if you encounter bad data as you iterate over the records that becomes an event.
In general, a span has a name
, duration
, and the custom attributes
. I like to name my spans by the pattern:
Scenario.Action
Examples:
Server.ProcessRequest
Database.Query
If the code base is large with many teams, consider prefixing a namespace for that group / sub-component to disambiguate and allow for naming collisions:
Namespace.Scenario.Action
Examples:
Finance.Server.ProcessRequest
Warehouse.Database.Query
An event generally is a message
with custom attributes
. Some people will still do string interpolation and put the attributes in the event message, but this practice should be discouraged. String interpolation in event messages:
- Wastes resources and is redundant (time generating string, recording the data twice)
- Makes it harder to query by message because content is dynamic
- Encourages the outdated mindset of thinking about logs as simple text messages
Rolling your own format is better than pure text logs, but consider using OpenTelemetry, which will get you compatibility with a massive instrumentation ecosystem from client libraries to monitoring agents, databases, monitoring suites, and much more.
Use metrics
Metrics are number measurements with associated metadata. The point of metrics is that they’re extremely efficient to emit and to process. Metrics can much more easily be ingested by real time monitoring at scale which enables automation to trigger incidents, automatic service heals, and more.
1
2
3
4
// Indicate a request was processed, hence the 1.
emit("Server.ProcessRequest", 1, status_code = 200, latency_ms = 12, length = 4593, resource = "/foo/bar", node_id = "704f9ab0-31bf-4318-8767-a39eb2d2d9b3");
// At this moment, the process is using 89MB of memory.
emit("Exe.MemoryUsageMb", 89, node_id = "704f9ab0-31bf-4318-8767-a39eb2d2d9b3");
Use uniform tables
Don’t create tables for different event types. Have a single table for spans, a single table for events. This means you never need to update code / add new instrumentation harnesses. You can use standardized abstractions like Rust’s tracing. Most importantly, how you emit telemetry is decoupled from how you process and view / query it.
Let’s say that you have some common data to work with like HTTP requests. It may be worthwhile to have a dedicated table for this where things are pre-parsed and the attributes put into dedicated columns. See if you really need it first. Telemetry databases are extremely fast, and unless you have vast amounts of this kind of data it will be able to parse on the fly as needed.
I work on a service where that isn’t the case, and we have hundred of terabytes of telemetry at any given point. For high use, high scale span types like HTTP requests we “cook” those records into a separate table by using features in the database to automatically build this second table as records are processed. If we need to stop doing it, or for less time, or add another cooked table for another span type all we have to do is update the config in the database, no changes to our service are necessary.
Have a central database
It still isn’t 1982 anymore. If you’re manually scrolling through text files, you are participating in a mass delusion and likely have an unhealthy organizational culture. This type and degree of Luddism is sure to permeate everything that team does.
Log all calls to external services
This includes external 1st party services! You need to be able to distinguish between your code and other people’s. You need to be able to tell what’s actually broken. Instrumenting the seams in software is important, there is inherent structural difficulties in bridges those gaps. When instrumentation is lacking, this is what happens:
If you properly instrument this, you can have seamless transfers. Moreover, if your platform supports distributed tracing where each step in the chain exchanges tracking IDs to follow a process end to end you can leverage automated fault analysis and directly route fails to the owners of the failing component.
Telemetry Mindset
Log “interesting” events
Telemetry is like comments. A lot of it is redundant garbage. Before logging an event, ask yourself if it’s interesting. Is this something you’ll ever want to consider in a live site situation? Is it useful relative to its frequency? For instance, some events are on occasion useful but the failures are so rare that they would drown out the other telemetry with noise.
If the frequency isn’t causing a noise issue, log all outcomes of the interesting event, not only failures. Only logging failures paints a misleading picture. It should only be reserved for when event volumes and success rates are both so high for that topic that a noise issue would occur.
Finally, ask yourself does this really need to be a dedicated log message, or is it better recorded as another property of the parent span? Leveraging fields / attributes / dimensions / properties or whatever your framework calls them is an effective way to improve discoverability and cut down on noise.
Instrumentation is an iterative process
Telemetry isn’t something you do just once. When you write new code, instrument where you think will make sense. Err on the side of a little too much. In production, see what works and what’s missing. Clean up where things weren’t helpful and fill in the gaps you missed the first time around.
Flying slower is better than flying blind
Doing everything outlined here is going to add overhead. Spooky, I know. There’s strings usage, tons of memory allocations, data serialization. Moreover, it’s 100% cost. This isn’t directly contributing to the system’s goals in any way. But, there’s simply no substitute from being able to see what your system did when something went wrong. At a large enough scale it’s hard to even know when something has initially gone wrong and is starting to spread.
You will encounter bugs. Unexpected inputs, integration hell, some weird issue caused by your hosting platform. Hopefully you aren’t encountering many self-evident bugs in production. If you are, your testing needs some serious work. The things moderate to code pre-release testing can’t always catch are these types of issues that tend to be rare and/or transient. You’re unlikely to be able to repro it on your dev box until you already understand what went wrong.
Computers are really, really, really fast. There’s no debate to be had here, the right choice is obvious.