Post

Not All Heartbeats Are Created Equal

Not All Heartbeats Are Created Equal

Measurements are always proxies

There’s a common adage that once a metric becomes a target, it ceases to have any value as a measurement. That’s not what we’re here to discuss today but it share an underlying concept. Much of what we value is the result of transitive relationships. We don’t value little green pieces of paper or ever increasing numbers of zeros in our checking accounts, we value things money grants us access to.

So too is our relationship with performance and quality metrics. It may be tempting to think otherwise. In software, our metrics are often very tangible. The P50 and P99 latency times on our services aren’t just scores, they’re marked determinants of the experience a user will have with our products. Even still though, the phenomena emerges. If all your metrics are excellent, say the P99 on loading your web page is 5ms, what value is there in continued improvement? When we decrease an imperceptible amount of time to an even less perceptible amount of time, we’re transcended purely into the realm of the scoreboard.

What we value is always indirect. The better experience, the repetitional gain, the approval of those who decide how much money we make this year.

Value free measurements

This is pretty banal though. We’re all aware of this transitive nature of value that’s why phrases like “gaming the system” exist. But it’s important to be mindful of this familiar territory when we consider value free measurements.

What does it mean for a measurement to be “value free”? It means that there’s no moralized or otherwise intangible value to the final measurement. There’s no connotation, prestige, or other emergent benefits. Take for instance, the heartbeat. The heart beat is a simple ping into the void. An acknowledgement that the service is here and doing things. It may include helpful information about the identity of the instance or its version number and capabilities.

There’s no subjective, let alone inherent value to be found here. We aren’t proud of our heartbeats, they aren’t possible to compare and rank. This isn’t true of uptime, but a heartbeat isn’t uptime. It’s a (poor) option for how to measure uptime through aggregation, but the instantaneous measurement is itself devoid of value.

We may not be able to compare against each other with heartbeats, but we can compare different heartbeats.

What are we actually measuring?

The point of the heartbeat, setting aside any instance metadata in includes, is to advertise that the system is live and in a specific state. Generally heartbeats are simple “I’m good” pings, but you can also have them periodically advertise that the instance is in a faulted state. Regardless of which model we choose the same question arises: how do we determine what’s a good state?

Like all metrics, our heartbeat is a proxy. It’s a proxy for “the agent is good”. Even if unhealthy heartbeats are a part of our model, the application still needs to be healthy enough to emit the metric as well as accurately able to distinguish between its healthy and faulted states.

Web servers

An eminently common piece of software that emits heartbeats is the web server. Whether you’re thread based, thread pool based, worker model, or asynchronous, it’s coming to have a periodic operation as a polling thread (or task) that emits heartbeats. For the web server to be healthy, it needs to be accessible over the network and capable of providing valid (non 5XX status code) response to clients.

For the heartbeat to be healthy, the polling thread just needs to be live and capable of publishing. This discrepancy led to a lot of confusion in a product I work on. Our web server was running, emitting heartbeats, and even emitting telemetry about other background tasks. We also saw that it was handling requests from all provisioned clients up until about the same point. The request volume would fall off a cliff for awhile and come back eventually.

This had all the classic signs of a networking failure, which was a semi common incident for us to encounter. Only after much consternation and back and forth did we eventually learn an obscure regression in code we didn’t work on would cause a deadlock, but only in the request handling portion of our web server and only when the number of clients grew sufficiently high to cache a hash table to resize.

It is worth pointing out that in this model, if we were using an asynchronous architecture and had scheduled the heartbeat as a periodic task, it too would have halted from the deadlock. However, this is really specific to this failure mode, there are other reasons not to consume the shared runtime for this task, and other failures would still go undetected in this regime.

Outside-in and Inside-out monitoring

When we measure the state of a system we have two options:

  • Outside-in: where some external piece of software probes the system. This could be through the operating system, by trying to actually consume the service as a fake user, and more.
  • Inside-out: where the software itself takes stock of its situation and reports out to the world. Heartbeats are an example of this.

Generally we want a mix of the two. Each is capable of answering questions the other isn’t. Inside out can’t confirm the infrastructure networking is configured correctly. Outside in from another network can’t assert the resource utilization of the system (otherwise you have a security vulnerability on your hands).

Heartbeats are a form of inside-out. Even here we have a diversity of choices.

Monitoring uptime on our new web server

As a part of the Azure Boost effort, my team is currently replacing a couple legacy code bases and consolidating them into a single new web server. In the early days we were deciding what things to copy over and what to drop, as well as considering what new platform options were available to us or that we just hadn’t previously considered.

Inside-out

This was pretty straightforward. The new service has a background thread that periodically emits heartbeats and we emit metrics for other important operations. Every external service call has a metric associated with it. As we covered in Flying Slower is Better Than Flying Blind, it’s critical to know how data and actions flow through a system end to end. We need to be able to isolate failures to specific links in the chain.

In our case, we emit a metric (a counter) for every external interaction. The metric is identified according to the scenario (receiving a request, receiving a notification, calling an external data store, etc) and includes data about performance and outcome. Success or failure / status code, size, latency or processing time, and identity or purpose context when applicable.

Our heartbeats advertise the node metadata (node id, cluster, datacenter, region) and the version number instance. For reasons we’ll explore more below, the heartbeat alone can’t actually be used to determine that the instance has gone down. Instead we have to author monitors against multiple data streams to infer that there should be a heartbeat where one is missing.

Outside-in

Our situation is a bit…. Atypical. My web service actually runs as a part of the physical node an Azure VM is hosted on, meaning that we have 1 instance per server and that instance is responsible for servicing the VMs currently hosted on that same node and only those VMs. In fact, for security reasons the network topology is such that each instance of our service is only accessible from within that physical server.

This makes outside-in monitoring challenging. We can’t create a server somewhere else in the data center or have off the shelf software probe our service over the internet, because our service doesn’t actually exist on the internet. This visibility issue is common to all services on these nodes, so we actually have an extra service on every node for solving the problem. We don’t just have one thing probing the service as a whole, there’s a watchdog on each node responsible for monitoring all the services on that node.

There are several ways it can do this, and the two that were being discussed were:

  1. Define a web request that will periodically be sent to the web server expecting a specific response
  2. Define a log file that the service writes heartbeats to that the watchdog will periodically check

I was reviewing the proposal, which suggested that we do both. Naturally, I needed to object on both fronts to bolster any accusations of being a contrarian.

The Heartbeat File

The heartbeat file serves no purpose in our use case. First of all, we already have a heartbeat that we publish into our centralized telemetry. Secondly, this isn’t really outside-in. Our service has to still write the heartbeat, it’s just that the watchdog converts that to a signal into data submitted to our centralized telemetry. By this definition, all monitoring is outside-in because an external service will monitor the fleet level data looking for discrepancies to fire alerts off of.

In fact it’s really just a worse version, because there are more points of failure in publishing the data. The real purpose of this platform feature is that each service owner can’t easily know how many instances of their service should be up. Azure is massive, there are many types of nodes, nodes can get recycled and acquire new identities in the process. It’s not impossible to hook into the fleet management data, but it’s a pain in the ass and overly complicated.

Instead, the watch dog is basically a cheat for the problem we described above, determining how many heartbeats there should be. In our case that still isn’t useful though, because we already are leveraging this capacity through the other watchdog at the request level. If the service is responding to requests, it’s also emitting heartbeats. Plus, the heartbeat is a less direct measure. Customers care if their requests work, not if our internal heartbeat is healthy.

It’s purely redundant therefore, and thus we are obligated to reject the feature for opportunity cost.

The Web Requests

I didn’t disagree on the suggestion we monitor via web requests, it is a web server after all. What did strike me as unnecessary was the suggestion we add a specific endpoint /health to the API for this purpose. There were several issues with this idea:

  • If the endpoint does anything useful, it’s leaking internal data.
    • You could devise a way that only the internal watchdog can access it, but this is overly complicated, error prone, and redundant. The agent is obviously going to have telemetry. There’s no reason it can’t just directly report its own health state, we don’t need the watchdog to ask us to self report and publish on our behalf.
  • If it’s not emitting metadata, then it’s redundant. We already have existing endpoints we could ping instead
    • In fact, that’s precisely what we do. Our API has an endpoint that lists all supported API versions, so we just query that endpoint to establish health. It’s always present no matter the server SKU or the service’s build number and it’s not dependent on external sources. It’s also the cheapest endpoint computationally.
  • If it’s not used or internal only, then it’s a less optimal measure of the sort of uptime the user actually cares about.
  • It’s more work. It’s an extra thing to create, discuss, and maintain. Yes it’s a small cost, but it’s still cost. We have a word for cost that has no reward, waste!

It’s true that these aren’t major issues, but it’s still important to hold the quality bar here in order to:

  • Create a culture of excellence and review scrutiny.
  • Build skill in the engineers. Even if people’s mistakes are small and relatively immaterial, the mistaken thinking that leads to them will someday be replicated in an arena where it truly matters.
All rights reserved by the author.