The Azure HA Transport Cert
Getting the mechanics right can goad us into thinking we got the wider context and properties right
Recently, this video from Kevin Fang came out: How Bad Leap Day Math Took Down Microsoft. You can see my original review of it as well as my answer to a question he posed as a joke about why it took so long to get a basic code fix ready in last week’s post.
Before continuing, note that these are my opinions of the situation. These are not statements on behalf of Microsoft and they may or may not align with the organization’s positions. While I will make some factual statements on different security primitives here, what Azure considers secure and why should only be determined by reading the official documentation on the relevant topics.
My One Correction
Kevin does get one part of the video wrong. When explaining the transport certificate, he does explain its function properly but it’s purpose incorrectly:
The HA is also responsible for delivering application secrets, such as api keys or database credentials to the VM. These secrets are encrypted in case the goblins in the middle have infiltrated the network. In transit encryption is done through the usual public key cryptography. On initialization, the GA generates a transfer certificate which contains the public key and sends it to the HA. This allows the HA to encrypt application secrets with this public key and send them to the GA which can then decrypt it using its private key.
This is not why that certificate is there and not what it accomplishes. Consider this, IMDS delivers MSI tokens in Azure in plain text as does the IMDS variant on all other cloud providers that offer an equivalent bearer token feature. Both Wireserver (where the transport certificate is sent, more on this later) and IMDS deliver secrets to the guest over unauthenticated HTTP. This is the tell that something else is it at play.
What is Wireserver?
Before continuing, let’s set some baseline context. In Kevin Fang’s video, he mentions that the transport certificate is sent to the HA (host agent). I don’t think it was feasible for an outsider to get this part more granular, nor to get the purpose of the transport certificate correct unless they’re a security expert. We can get a little more specific than that with the benefit of some insider knowledge.
Nothing I’m disclosing here is insider information, this is all determinable by packet inspection as well as by reviewing the code of open source guest software like cloud-init and the Linux Guest Agent. As an insider, I merely have convenient pre-existing knowledge of these random topics in addition to my specialization in security and the security aspects of operating and improving these services.
For starters, HA isn’t a single thing. It’s a term for numerous services on the host node. Some of these services communicate with the guest VM over HTTP, the most well known being the Instance Metadata Service (IMDS). IMDS is common to all cloud providers, but Azure predates when AWS created the IMDS concept that took over as an industry pseudo-standard, so it has an additional similar service: Wireserver.
To say Wireserver is an obscure Azurism is putting it mildly, this tiny docs page is the entirety of the information we provide on it. This is going to change in the future because of a tented project I’m working on, but you’ll have to wait to learn more that. For the purposes of this conversation, it’s fine to view Wireserver as an “additional IMDS” with slightly more restricted access that only handles endpoints that are useful for Azure to operate the platform, whereas endpoints useful for both 1st and 3rd parties go into IMDS. A security researcher once described Wireserver as “[seemingly] the backend portion of the guest agent”, which is a pretty accurate framing.
Why challenge that the certificate prevents man in the middle attacks?
Being correct is virtuous in its own right, but there’s more to this correction than a mere um, actually for accuracy’s sake alone. The mistake gets to the core of a common lapse people make when working with security topics. If I’m using a common primitive with a valid, secure implementation, then I’m getting the benefits I associate with that primitive. But this isn’t the case.
I encounter this exact mistake on a monthly basis in my role reviewing security proposals. If you don’t have a strong and recent background in cybersecurity, it’s easy to forget where the security guarantees of a particular technology are actually coming from.
Security? No, magic numbers!
How does your computer know who you are? How does it know if it’s talking to the computer it thinks it is? It may be tempting to mention certificates, private keys, and the like. But that’s begging the question. At the end of the day, any time a computer is evaluating an identity it’s going to hit a branch. If true, continue. If false, reject the communication.
There’s no special instruction in your processor for “confirm their identity”! What even is an identity? How do I confirm one? When we establish an identity, or any other kind of trust in a computer, what we’re really doing is we’re confirming that the other party knows some hard to guess magic number.
No, seriously! I defy you to find a single exception. No matter how sophisticated a scheme you employ, the final step is going to be to confirm if they hold access to some secret magic number.
Security Mechanism | Magic Number Evaluation |
---|---|
Passwords; Passphrases | Text maps to character codes, which in a string just form a really long number. |
PIN | Literally checks if you know the number. |
Encryption Key | A random number used to map data from encrypted to plain text or vice versa. The mapping (encryption) algorithm is constant and known. What determines access is merely if the right number is known for input to the algorithm. |
SSH Key | An SSH key consists of a public key and a private key. The public key is used to encrypt messages to the private key holder and to establish trust (i.e. adding a public key to the known hosts list). The private key is used to prove identity andto decrypt private messages sent to that identity. |
Certificate | A certificate is essentially just a more detailed SSH key. It can have public or public + private keys, but also contains additional metadata around purpose, identity, expiration. Certificates can also be “chained”, meaning a certificate can digitally sign a second certificate to endorse its validity. This is very useful for managing trust at scale as we’ll see later on. |
Smart card; YubiKey; etc | These are just hardware devices for safely storing certificates. |
Key Fob; Toll Tag (RFID; NFC; etc) | Older ones just broadcast a unique number. Newer ones have a private key for a signature like certificates to prevent replay attacks. |
Facial Recognition | The camera samples your face and generates a numeric representation. This is similar to a password, it’s just the mapping is much more complicated and allows for “fuzzy” (partial) matching. This is necessary because your picture won’t always match 100%, and is generally safe because your photo data is so much longer than a password that it’s still sufficiently hard to guess the right number (meaning generate a signal input that yields an accepted mapping). |
The transport certificate
This takes us back to the transport certificate. The GA generates a new, self signed certificate. Self-signed is a critical distinction that doesn’t come up in the video either by mistake or because Kevin was understandably going off the incident blog post not the actual code where this element would be clear.
Recall in the table that we said certificates can be chained? When we say a certificate is “self signed”, we mean that we didn’t request some third party, a Certificate Authority (CA), to generate and/or endorse the certificate for us. There is no chain, this certificate just exists on its own.
What does an asymmetric key actually prove?
If you have a public key, you can confirm that a message was digitally signed by the corresponding private key. We tend to say that this in turn verifies the identity of the message sender, but that’s reliant on two further assumptions:
- The private key is actually secret and only known by the identity holder
- The public key is actually associated with that identity
The magic numbers here just confirm that the message was endorsed by a private key, and the private key is indeed associated with the public key. You may be tempted to say yes, but with SSH keys I entered the public key myself so I know it’s associated with me. But what about with certificates?
SSH keys work well because you’re just establishing trust between two entities that you own, and you have some other secure channel by which to configure them. You can copy by hand, send a text to yourself, or enter a key into the git host you’re trying to authenticate with while already logged in through the browser.
But what if there is no channel? When you need to log in to your bank, how do you know your bank’s public key? The cert chain of course. We don’t use SSH keys for HTTPS we use certificates. The certificate is endorsed by a CA, problem solved.
Except… The CA endorses the cert by signing it. Meaning the endorsement was done using another certificate and that endorsing certificate’s public key is included in the endorsement. This cert is endorsed by another cert, and another and another…
How do trust chains ever verify identities then?
If this seems infinitely recursive, it’s because it is. Remember how we said certificates can be self signed, meaning arbitrary? Well, we can use one of these as our “root of trust”, our base case to break the infinite recursion. When a self signed cert is used as the base of a trust chain, we give it a special name: Root Certificate
. Your operating system and key programs like web browsers have a set of hard coded public key only copies of root certificates that they trust.
Everything else that your computer accepts while communicating securely over the internet is accepted because if you walk the certificate chain back far enough, it will end it one of these 10 - 20 roots. These are changed very infrequently via software updates.
As you can imagine, the world would grind to a halt if any of the corresponding private keys were stolen. This is why we don’t use these keys directly for much, and they’re stored in specialized hardware. These certificates endorse certificates that endorse certificates. The farther down the chain you get, the less critical it is to keep the secret, well, secret, and the more easily it can be revoked and replaced.
What the transport certificate actually does
I can’t say why it’s there and why it’s inconsistent with the other secrets delivered by the same means, as this code was written well before my time. Even if I knew, that information very easily could fall under NDA. From a purely analytical perspective though we can still determine what it accomplishes:
By transporting the secrets on the “network” in cipher text, it ensures that any unintentional capture of the network data by packet capture, logging, etc doesn’t cause copies of the secrets to float around insecurely.
That’s it. It doesn’t prove identity, because it’s an arbitrary self signed certificate. Wireserver / HA has no way of knowing, from the certificate, if that request actually came from the VM in question and not some malicious actor trying to steal the VM’s secrets. A man in the middle attacker could just generate their own certificate and Wireserver would dutifully accept it.
Why is it safe then?
The certificate isn’t ensuring the request came from the right VM, and not all secrets from the host are sent as ciphertext. From an HTTP security perspective, these are unauthenticated / anonymous APIs. As covered in the docs, traffic to these services never leaves the host. When your VM sends traffic to these special, unroutable IP addresses, the virtualization platform intercepts the traffic and reroutes it to the proper HA services.
In other words, the host is ironically doing a man in the middle attack! You send a packet to the magic IP address. The networking stack sees this and tampers with your packets. Changing the destination IP + port to the webserver hosting that service. You’re none the wiser, because this is just plain old HTTP.
In this sense, the request never really hits the network in the traditional sense. It’s a point to point connection between your VM and its host. If there’s no “network” nodes the messages are passing through, then there’s no opportunity for a malicious man in the middle attack. This is also why you don’t have to prove your identity. By virtue of using the VM’s network interface, the host knows which VM the request came from because the host’s hypervisor is managing the virtual network interfaces. This is how it can figure out which secrets to respond with and how it ensures secrets are only visible to the corresponding VM.
This is also why IMDS doesn’t use HTTPS. Encrypting the traffic over the network isn’t accomplishing anything because it’s not going over the network. HTTPS also requires certificate chains, which means we need root + intermediate certs and/or for clients to opt-in to accepting self signed certs from the host. This is additional complexity for users that can be quite challenging on VMs that have locked down networking and thus can’t fetch certificate chains on the fly.
The transport certificate mechanism is a middle ground. We aren’t trying to prove identity, because the identities here are implicit in the VM / Host relationship. So self-signed will do. It’s not a burden, because the only consumer is other Microsoft code. While there’s no in transit confidentiality to achieve, we are still getting defense in depth against unintentional capture of secrets in logs and other recordings, so it makes sense to do.
Even then, it’s not quite free. After all, even the “simple” approach of using self signed certificates still managed to cause one of the worst outages in Azure’s history one fateful leap day 😉