Post

How Quickly Should You Patch?

How Quickly Should You Patch?

It’s always fun to watch outsiders try to piece together the internal chaos you’re fully aware of.

At time of writing, I’ve worked at Microsoft for about 6 years. In that time I’ve done a short stint in Dynamics before spending the majority of my time deep in the bowels of Azure’s basement. I primarily worked on Azure IMDS, which I’ve nicknamed “the octopus” because it has so many integrations that we have our tentacles in pretty much everything.

Before continuing, note that these are my opinions of the situation. These are not statements on behalf of Microsoft and they may or may not align with the organization’s positions.

This unique vantage point has coincidentally exposed me to areas that are sometimes covered in public media. It’s always interesting, and it tends to go through these phases:

*Oh wow, you’re doing great you’ve figured out so much. Wait, what? You got so much how did you miss that! Ha, you’re more right than you realize this goes so much deeper than I’m allowed to share.

I don’t consider sharing that experience to be bold or outspoken. If you’ve ever worked on legacy code, massive code, critical code, or a mix thereof, you know how chaotic things can get. While I’m sure some places are somewhat better than others, the reality is any time you get large groups of people together certain incentives will prevail and the results will inevitably converge around a few themes.

Recently, this video from Kevin Fang came out: How Bad Leap Day Math Took Down Microsoft

A middle manager shared it in a big group chat because it covered one of our products, and some of the older people were even around at the time and worked on this very incident! It’s well done, humorous, and worth a watch.

Watch at least the intro before reading on for best effect!

Kevin Fang Dove Deep

The opening to this video is such a banger. After a funny joke, the script goes:

… faced a major outage when its VMs’ GAs failed to generate transfer certificates, causing HAs to report the servers as faulty to the FCs, which would trigger automatic service healing which would inadvertently exacerbate the issue and eventually take down the entire cluster.

This is a clever hook for many reasons. He’s set the context, done it engagingly, and overwhelmed the viewer with a ton of terms. You could even say thematically it captures how quickly things can spiral out of control in an outage and how complicated it can get to unravel. He closes the sale by stepping back to acknowledge how that rapid fire explanation probably makes no sense, but will by the end. I love the way this intro was structured.

That said, I had a very different experience than he probably intended.

As someone who actually knows what all these things are but wasn’t aware of this story (this outage happened when I was still in high school), I was thrown off by how accurate, succinct, and aware of internal systems it all was. I immediately understood what was happening and why that would cascade into an outage.

Moreover, I was blown away he mentioned the transport certificates because it’s a highly obscure, undocumented aspect of the system that my team and I own today. It turns out, a lot of this comes from a blog post published after the outage, but I’m still impressed by Kevin’s ability to weave it into a succinct and compelling story.

His next line truly breaks me:

… and think “Huh, it all makes sense now. Perhaps I learned something on YouTube today, even if it was just a bunch of domain specific terms I’ll never see again for the rest of my life.”

Kevin. My friend. My spiritual connection. I have never felt more seen in this moment. I have accumulated vast swathes of this type of knowledge at Microsoft, and you’ve captured the feeling one gets after fixing an issue caused by obscure problems so well. Even after you’ve gone through this process as the owner of the obscure behavior, this “I’ll never see again” factor often plays in.

By 6:38 PM the devs discovered and had a good laugh at the trivial leap day logic, and the fact they weren’t going to sleep for the next 24 hours.

This guy gets it. A wise man once said “if I didn’t laugh, I’d cry”, and there’s no truer ethos when you’re the on call engineer. In fact, just last week I was on a bridge until 4am for an even more trivial issue, someone changed the system temp directory in a way that breaks my service, but only on highly specialized SKUs and only partially rolled out 🥲

TLDR; The Leap Day Bug

Just watch the video dummy.

“Only Those Involved Will Know Why”

I do want to discuss one other part in more detail though:

By 6:38 PM the devs discovered and had a good laugh at the trivial leap day logic… By 6:55 PM the engineers disabled customer management of VMs… By 10 PM, they had a plan. By 11:20 PM they had the GA fix ready. Only those involved would know why preparing the fix to properly increment the date by a year took 5 hours. By 1:50 AM the next day, they finished testing.

Wait a second… At long last! My time to shine. I am those.

Why did it take 5 hours to get a patch?

I know it’s just funny ribbing, but it’s funny because it’s true. People often wonder, why does it take so much time and effort to make trivial changes to large systems in big companies? There are 2 reasons:

  • Too many humans
  • Quality controls / systems bureaucracy

I don’t know how many people worked on Azure at the time, but today there are thousands of engineers. Even the tightly coupled, narrowly focused interactions discussed here involve about 20 teams for the end to end functionality. That might sound overstaffed, but the scale of Azure is profound. I’ve seen teams where there’s 1 dev per ~375K instances of their service (as in they’re responsible for maintaining 99.99% uptime on each of that many servers) and that’s actually a massive responsibility reduction from years past! Apart from just finding what’s wrong, the number one source of dwell time on service incidents is just pulling in the next team’s on call engineer.

This is an underrated aspect of why good monitoring is so important. It isn’t just that you notice the fault quickly, it’s that having sufficient monitoring to do so will necessarily entail that it allows you to localize the fault source quickly. In the middle of the night, each time you walk up the casual chain that’s adding another 10-15 minutes to get the next engineer on the bridge and brought up to speed.

Part of the delay can be as simple as that just because you discover “oh this cert is malformed”, that doesn’t mean you know who made the cert. It doesn’t mean you know exactly what line of code is broken. Even if you do, it still takes time to get a fix in. Knowing the internal structure, I can promise that the folks responsible for making the fix weren’t the ones that initially discovered the bug.

The need for constraints

Every change risks creating an outage. Changes for fixes are actually at even higher risk. You’re doing them quickly, with limited supervision or review, under pressure, without proper rest. The irony of the situation is that as humorous as it is that it took 5 hours to get this change ready, the constraints in place weren’t good enough and the change was still done too brashly! Remember, the change caused yet more failures with networking.

In much of the HA area, PR validation builds can easily take an hour or more. Then there’s initial production testing before it’s released to customers, then doing the release, then that release goes slowly in stages in case it has issues of its own. Microsoft chose to “blast” the nodes, meaning this last part was largely skipped. Trying to go faster actually made things go slower on those specific clusters, because more outage faults were introduced from rapidly shipping out that broken patch (though overall at the fleet level it was still faster which is why this feature exists).

Another consideration is that mitigation (preventing the spread) was focused on first and that the changes were going to involve updates to other software that is harder to change. This gave the GA side some breathing room. If it’s not going to be shippable right away anyway, it’s better not to rush.

Now to be clear, the GA is a very different component. This is a very simple change, and it has fewer validation stages that all run much faster. For instance, their PR builds don’t take over an hour. But when we design systems and policies, they can’t be predicated on “but this change is simple”. Mistakenly believing so is how you introduce errors, and thus the system must treat every change as a potential disaster, slowing things down.

Over the years, improvements in the on call process and engineering systems mean this would all be faster. But interestingly, the overall time to fix probably wouldn’t be much faster. With the efficiency gain, yet more quality gates have been added. As the botched patch demonstrated, the gates weren’t good enough here. So if you’ve gotten the checks running faster, you’re probably better off using that freed time to run additional checks.

My One Correction

Kevin does get one part of the video wrong in a way that’s understandable and common, but belies critical errors people frequently make when assessing the security properties of a system. I think that example is compelling enough to warrant a discussion, which you can read in next week’s post.

All rights reserved by the author.