Post

Flying Blind is Better than Crashing

Flying Blind is Better than Crashing

Flying slow is better than flying blind, but flying blind is better than crashing

How Do You Raise the Alarm When the Alarm is Broken?

I recently reviewed a PR adding instrumentation to a new project. One of our engineers was following the best practice of not allowing for silent errors. They were also taking advantage of Safe Deployment Practices. As I noted when discussing if recoverable errors should be recoverable, with SDP it’s often better to respond to even low severity non-recoverable errors by crashing because this will generate a failure signal that will halt the deployment of the broken changes.

This engineer also subscribed to this way of thinking, so they added an .unwrap() to the telemetry initialization logic. In rust, this means if the result is an error type, the program will immediately crash. That wasn’t the correct choice, so let’s discuss why.

Expanding on Recoverable vs Non-Recoverable

This is getting at the opposite of the question of that I discussed previously. Should non-recoverable errors be non-recoverable? Just like with the inverse question, the answer is it depends.

A non-recoverable error should result in a crash when the failed component or operation is integral to the running of the product as a whole, and as a result there’s no degraded state to run in that makes sense.

Non-recoverable means in the moment, not permanently. Part of why we want to crash is that a restart may clear the fault.

Examples:

  1. A shared library fails to load or the process runs out of memory. There isn’t a sensible next thing that could be entertained here. We’ve violated the invariants of our program so badly the next step will be undefined behavior. We have to crash to avoid corruption or memory exploits.
  2. Imagine you have an API gateway in front of two microservices and one of them goes down permanently from a bricked update. Should we crash in the gateway? No. We’ve lost 50% of our functionality sure, but the other 50% is isolated and still works fine. By allowing that to spread we’re just making the problem worse.
  3. Your billing service goes down and customers are getting free compute units. Nope. First of all, the brand damage from screwing your customers there would be immense. Secondly, you don’t get to bill them when the product goes down anyway your contract will reimburse them. Either way, you’re out money so take the L in stride and keep the system running.

So where does telemetry fit in? First, let’s touch grass and consider an instructive real world scenario.

Toilets and Fire Alarms

My first post college job was at Microsoft, but in Fargo, ND. The Fargo campus is bigger than you’d probably imagine, and at the time consisted of 4 fairly large buildings. One of the buildings had a strange curse, during after work functions the fire alarm would sometimes go off for no discernible reason.

There wasn’t a fire, smoke, or lightning. There wasn’t anything that would cause a false alarm. In fact, the system was working by design! Yet time after time party goers would fine themselves standing in the parking lot, drink in hand, watching the fire department confirm that the very much not on fire building was indeed not on fire.

So what gives?

Well, what happens when you condense hundreds of people into a smaller area than normal and pump them full of refreshments? They urinate. A lot. Dare I say profusely.

In large buildings, commercial and residential, you’ll notice the sprinkler systems. What you might not realize is that they are full of high pressure water continuously. This is because they’re activated by pressure. The water extends the sprinkler and that causes it to spray. Plus, it’s a lot of piping. If there’s a fire, we need to spray it now. We don’t want to wait for a buildings worth of plumbing to fill up.

Commercial buildings also have more centralized mechanisms for generating that high pressure commercial flushing action we’re all so used to. So what happens when pretty much every toilet is flushed at the same time? The water pressure drops. It drops so much that it drops below what is necessary to activate the sprinklers in case of emergency.

The building probably isn’t on fire. It probably will continue to be safe. But what if it isn’t? The reason the sprinklers are there is because they’re necessary to make such a large structure safe. By definition then, if the sprinklers aren’t working, the building isn’t safe. To account for this the fire suppression system is equipped with sensors, and if the pressure drops too low it will automatically activate the fire alarm.

Application to Telemetry

The lesson here is twofold. If your monitoring fails you must assume the worse and you must fail gracefully into as good of a state as you can. Telemetry isn’t integral to the operation of your product. It’s very important, but for the user it’s a transparent feature. The app or service will work fine without it.

In the case of fire sprinklers the alarm goes off. But nothing wild happens either. You don’t try to turn on parts of the system that are still working, you don’t cut the power, you don’t do anything dramatic. The system is marked “human investigate” and warnings are issued. This process must be externally driven by a second layer of monitoring. The smoke detectors monitor for fire, and the water pressure sensor monitors the monitoring system.

Both factors apply to telemetry. Telemetry should never* induce the service to crash. In languages with exceptions, this also means it should never throw.

*In my opinion, it is ok to allow memory allocation failures to cause crashes in telemetry. If it’s happening in your telemetry, it’s about to happen soon anyway so I don’t think it’s a meaningful feature worth the effort. That said, it is possible to avoid this. You can have well defined behavior when the memory allocation fails by aborting the log operation and dropping the data. That’s no different than crashing anyway where in-flight telemetry data is lost.

However, we need to assume the worse. If you’re not getting telemetry you have to assume there’s an emergency outage ongoing. Because if there is one, you’d be missing it. Most of the time nothing bad is happening, but we can’t take the chance. Generally speaking, this is accomplished by monitoring either that the number of reporting nodes meets your expectation, or by measuring drops in overall log volume. The first one is better, but not possible in all workloads.

All rights reserved by the author.