Post

Rust Never Finds Safety Issues

Rust Never Finds Safety Issues

The Rust compiler is proving a negative

Rust’s claim to fame is that it allows low level access while still guaranteeing memory safety. In other words, the compiler is proving the non-existence of specific classes of bugs in your program.

This is a monumental departure from how you’re used to interacting with your dev tooling. In other languages your compiler and linter are making affirmative claims about your code. “You tried to assign an int to a string”. “You used this deprecated function”. These tools aren’t making exhaustive claims about your code, they have a set of rules that they check against. When the build passes, the code isn’t free of issues, it’s free of the issues in the list the tools evaluated against.

When learning Rust, it’s tempting to carry this baggage over. When the compiler says something is unsafe, it’s commonly interpreted as “the compiler found a safety problem in the code”. This isn’t accurate. Being able to verify the absence of something does not automatically mean you’re capable of detecting the same problems. In fact, the Rust compiler is one such example.

Proving a Negative

It’s common to object someone is trying to get you to “prove a negative” in a debate. Rather than making affirmative claims of their own, your interlocutor is trying to shift the burden of proof on you. Instead of justifying why we should think aliens have visited Earth, they goad you into trying to prove that aliens haven’t visited Earth. This is a bad approach to most subjects because most topics are far too complicated to prove.

We have reasons to believe things, maybe even overwhelmingly strong reasons to believe them, but when the bar is that we must prove that something is the case, well most problems are just far too complicated to do so. There’s also an obvious double standard here. If you can’t prove something that doesn’t lead any credence to the opposing idea, particularly when there’s even less evidence in favor of it! This is why the burden of proof is on the person raising the argument, and why claims need to be affirmative to have any value. As the maxim goes:

That which can be asserted without evidence can be rejected without evidence.

That isn’t to say it’s impossible to prove a negative. It is very possible in fact. To do so we need to be able to describe our claim very precisely, and only in terms of other claims that have been separately proven or are assumed to be true. Without getting too into the weeds on philosophy, everything works this way. When we say that “2 + 2 on my calculator will always be 4”, it’s predicated on the assumption that the circuitry was manufactured with sufficient precision such that our model of its behavior applies and that the environment won’t disrupt it in some way (say I bring a giant magnet over).

To do this with complicated tasks and arbitrary inputs, we need to add strong constraints to limit the problem scope. In memory managed languages, safety is achieved by assuming the implementation of the garbage collector doesn’t have issues and by constraining the programmer by never allowing them to work with memory directly. With these factors, we can affirm the antecedent to prove the negative:

  1. If all users of memory do so safely, there are no memory safety issues
  2. The only user of memory is the garbage collector; the garbage collector uses memory safely
  3. Therefor, there are no memory safety issues

In Rust, we assume the compiler is implemented correctly, but what constraints allow us to keep direct memory access without losing safety?

The Borrow Checker and Mutability

The compiler has a few basic rules for memory safety that it enforces:

  1. You can hold at most one mutable reference to data
  2. Data can only be sent across threads if it is always safe to do so
  3. References can only be held across threads if it is always safe to do so
  4. Any code whose safety hasn’t been verified by the compiler must be marked unsafe

The borrow checker is the part of the compiler that validates these rules at compile time. If the borrow checker confirms you’ve followed these rules, it blesses your code as “memory safe”. If you violate any of them in any way for any reason, it will issue a compilation failure listing which rule(s) you’ve violated.

When we think about what it means to write safe code, we mean that it’s free of memory exploits, memory bugs, and race conditions1. Don’t take the nomenclature too literally. The borrow checker isn’t actually determining if your code is safe or unsafe, it’s only determining if these rules were followed.

In fact, these rules are pretty obviously far more restrictive than needed to write safe code using our definition. Why is this?

Intractable Problems

Writing a compiler that ensures there are no memory safety issues while giving the programmer unfettered access to the underlying memory is an intractable problem. We need to verify all possible inputs, on all supported architectures, for all possible runtime orderings, for all possible programs. Without additional constraints, we’d be forced to run the program in its entirety, and if it’s multi-threaded or otherwise interacts with the external world, we’re even more screwed.

Setting aside the indeterminism of the latter scenario, even the simpler scenario is itself intractable. If the problem is sounding familiar, it’s because it’s related to “The Halting Problem”. Nearly a century ago it was conclusively proven that no solution exists. We can’t write a program on a Turing Machine (which your computer is equivalent to for this purpose) that can determine if another program will halt in a finite period of time.

The rules Rust picks seem reasonable, maybe even obvious at first glance. These are the things you need to do to write safe code. If multiple threads can mutate data, how is it going to be safe? In actuality these constraints are somewhat arbitrary. There’s no “right” set of rules. What we’re seeking to balance is the degree of constraint placed on the developer against the complexity of the task the borrow checker has to perform.

So why does Rust use these rules? Because we need to constrain the problem, and these rules allow the borrow checker to run in a reasonable amount of time over any possible Rust program while also being (relatively) straightforward to explain to the programmer.

Detecting Issues vs Certifying Issue Free

The borrow checker is incapable of detecting if your code has memory safety bugs. That’s not its goal. The rules it verifies aren’t entirely arbitrary though, they have an important property. It is impossible for code that follows these rules to ever have a memory safety bug. So, if we can confirm your code complies with these rules we have indirectly proved your code is free of memory safety bugs. This can be proved mathematically, but let’s take it for granted here.

But what if we figure out that your code doesn’t follow these rules? Does that mean we’ve found a memory safety bug? No, nothing of the sort. We’ve merely confirmed the rules weren’t followed. Maybe, the code is safe and maybe it isn’t. We can’t tell, so to be safe a compiler error is raised.

This is extremely critical to wrap your head around lest the compiler mind break you when your write code that really seems to be safe and even may actually be safe, especially when working with generics. In practice, many times what the compiler flags is going to be a genuine safety issue, but it also frequently isn’t one. Failing to recognize that makes it very difficult to resolve the issue because you can tunnel vision into trying to figure out “what the unsafe / bugged code is” when what you actually need to do is figure out what is missing that prevented the compiler from getting enough information to verify your code is safe. When faced with one of these errors, your first step is to figure out which of these two branches it is.

The Borrow Checker & Thread Safety

Let’s get a common, fundamental misunderstanding out of the way about what Rust does and doesn’t prevent. The borrow checker does nothing to verify your code is free of data races. Yeah, seriously.

It’s correct that the borrow checker doesn’t allow you to hold multiple mutable references to the same data, but that analysis has nothing to do with threading or data races. Even if I have just a single thread, the borrow checker isn’t going to let me do this:

1
2
3
4
5
let mut x = 5;
let x_ref = &mut x;
let second_x_ref = &mut x;

*x_ref = 6;
1
2
3
4
5
6
7
8
9
10
error[E0499]: cannot borrow `x` as mutable more than once at a time
 --> src/main.rs:4:24
  |
3 |     let x_ref = &mut x;
  |                 ------ first mutable borrow occurs here
4 |     let second_x_ref = &mut x;
  |                        ^^^^^^ second mutable borrow occurs here
5 |
6 |     *x_ref = 6;
  |     ---------- first borrow later used here

All the borrow checker is doing here is ensuring the guaranteed integrity of memory by requiring adherence to its rules. When Send + Sync come into the picture, the same applies! The borrow checker isn’t considering thread order and data races, it’s considering memory safety. Later on you will see an example that has no data races, yet the compiler will still be unhappy with it for this very reason.

If you’re still skeptical, look no further than the official documentation which states that Rust mostly prevents data races because without unrestricted shared mutable state it’s much harder to introduce such bugs, but Rust doesn’t prevent data races as a class of issue. This is in stark contrast to memory safety issues, which it eliminates. I won’t bother going into examples because the docs already cover that well.

The borrow checker ensures that memory is handled safely between threads, but it doesn’t prevent race conditions.

Send + Sync

If all you want is a description of Send and Sync, well the book already covers that. The borrow checker ensures that memory is safely handled within threads and between them. It does this with, you guessed it, more restrictions on what the programmer is allowed to do that ensure the borrow checker can validate safety in a reasonable amount of time with reasonable language ergonomics.

Rust has two marker traits2 for ensuring safety across threads:

  • Send indicates the type is safe to send (move; transfer ownership) to another thread.
  • Sync indicates the type safe to share between threads, meaning multiple can safely hold simultaneous reference to it. In more formal terms, a given type T is Sync if and only if &T is Send.

Note we’ve said nothing about data races here. There’s no notion of ordering or access rules. That’s because the borrow checker isn’t concerned with this topic. It’s merely ensuring safe stewardship of memory ownership.

Putting It Into Practice

When you read the docs, this can seem pretty straightforward. I’ve noticed that the common first contact with a complete understanding of the topic is this aspect of Arc:

Unlike Rc, Arc<T> uses atomic operations for its reference counting. This means that it is thread-safe.

This is, in my opinion, really unfortunate wording. It’s common wording in language documentation though, you’ll see the same thing in docs for C++’s std::shared_ptr. Precise language matters, and this is technically a reading comprehension skill issue, but it’s eminently common. The “it” in the last sentence of the excerpt is referring to the Arc, to the smart pointer. It’s not referring to the data being pointed to.

In addition to being technically correct💺, the rust docs do clarify this distinction as long as you keep reading:

Arc<T> will implement Send and Sync as long as the T implements Send and Sync. Why can’t you put a non-thread-safe type T in an Arc<T> to make it thread-safe? This may be a bit counter-intuitive at first: after all, isn’t the point of Arc<T> thread safety? The key is this: Arc<T> makes it thread safe to have multiple ownership of the same data, but it doesn’t add thread safety to its data.

Which goes to show why you should, you know, read through to the end. Though I will defend your eyes glazing over the example that is unnecessarily complicated. It’s not that it shouldn’t be there, but there should be a simpler one first. I feel a new Rust programmer is going to encounter Arc, Mutex, Send, and Sync well before they ever consider leveraging a RefCell.

Tying It All Together

As a reminder, thread safety here means “safe to use across threads” not “free of data races”!

Arc is always thread safe, because it’s just a pointer and an atomic counter. The data it points to may be thread safe, so Arc will apply this transitively. If the underlying type is Send and Sync, then so is the Arc. The borrow checker doesn’t flag actual memory safety issues, it flags violations of its rules.

Let’s consider two different kinds of shared smart pointers in Rust:

  • Rc - A reference counted pointer. It consists of a pointer and an integer used for reference counting. It is not thread safe, even if the underlying data is, because while a pointer is safe (because it’s held constant between increases and decreases to reference count) the reference counter isn’t. Two threads might give up a reference at the same time, and could update the integer in conflicting ways.
  • Arc - A thread-safe reference counted pointer. It consists of a pointer and an atomic integer used for reference counting. This means that if multiple threads need to change the reference count, it will happen in a controlled and thus safe manner.

Rc is marked to never be Send nor Sync, and as discussed before Arc is Send and/or Sync if and only if the underlying type is. What’s interesting here is that Rc can be Send sometimes, but to keep with its semantics it’s explicitly marked !Send (meaning that even if the compiler can prove it’s Send, the compiler will act as if it isn’t). This can lead to situations where code is perfectly safe yet still will be rejected.

1
2
3
4
5
6
7
8
let safe_value = Rc::new(5);
let job = (safe_value.clone(), safe_value);

std::thread::spawn(move || {
    let (pointless_copy, value) = job;
    std::mem::drop(pointless_copy);
    println!("Value: {}", value)
});

There is nothing unsafe about this code. We have two copies of a reference counted pointer. When the copy is created, it’s on the safe thread so Rc can safely increment the counter. Both copies get moved to our new thread. In that thread, one is deleted. But, since all references are on the same thread still, the counter can be safely decremented without any additional synchronization. Despite that, the compiler complains it’s unsafe to send the smart pointer across threads:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
error[E0277]: `Rc<i32>` cannot be sent between threads safely
   --> src/main.rs:7:24
    |
7   |       std::thread::spawn(move || {
    |       ------------------ ^------
    |       |                  |
    |  _____|__________________within this `{closure@src/main.rs:7:24: 7:31}`
    | |     |
    | |     required by a bound introduced by this call
8   | |         let (pointless_copy, value) = job;
9   | |         std::mem::drop(pointless_copy);
10  | |         println!("Value: {}", value)
11  | |     });
    | |_____^ `Rc<i32>` cannot be sent between threads safely
    |
    = help: within `{closure@src/main.rs:7:24: 7:31}`, the trait `Send` is not implemented for `Rc<i32>`, which is required by `{closure@src/main.rs:7:24: 7:31}: Send`

This is obviously a contrived example for the sake of brevity. However, it’s perfectly straightforward to imagine a scenario where you have multiple owners of the memory, all within the same thread, that are themselves encapsulated into some single entity. For instance, a work job in map reduce. Using Arc here would be superfluous, because the reference count only ever changes within the “job” context and the job is always on a single thread at a time. We could safely move the job over to some other thread for continued processing.

It’s easy to see how quickly this becomes intractable for a compiler to evaluate though. Instead we take the shortcut, if you use Rc then we just assume it’s unsafe. To verify otherwise is too complex for the compiler. It’s also really hard for a human to reason about and risks bugs for essentially zero gain. In fact, the gain is so minimal it’s pretty rare to even see Rc used. Even if it can be, the headache of later needing to send across threads usually isn’t worth the hassle and you’re better off just taking the version that will always work at the expense of truly minor overhead.

In the off chance it really is worthwhile though, Rust offers an escape hatch: the unsafe system. The beauty of this system is that it picks a simplified regime that meets most needs, but you’re never forced to comply with it. So long as you understand the risks, the compiler will let you suppress this violation.

A More Complicated Example

The following exchange is what motivated this blog post. We won’t discuss anything new here, but let’s consider how what we’ve learned plays out when generics are used, as this is a common problem to get stuck on. A colleague was having trouble with this code snippet:

1
2
3
4
5
#[derive(Clone)]
pub struct InnerInstanceManagerServer {
    active_inventory: Arc<Mutex<Box<dyn Inventory<Uuid, Container>>>>,
    instance_specs: Arc<Mutex<HashMapInventory<Uuid, InstanceSpec>>>,
}

Which when consumed elsewhere produced:

1
2
3
4
5
6
error[E0277]: `(dyn Inventory<Uuid, Container> + 'static)` cannot be sent between threads safely
   --> src\instance_manager\server.rs:41:26
    |
41  | impl InstanceManager for InnerInstanceManagerServer {
    |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^ `(dyn Inventory<Uuid, Container> + 'static)` cannot be sent between threads safely
    |

Which led to the exchange in a group chat:

P1: I know that inventory is not thread safe, thats why by put it behind a Arc<Mut<, why is this complaining about this inner dyn type? I don’t have this problem if I use the concrete type, but when I hide it behind a box or generic I run into this nonsense.

… P2: Arc is only Send + Sync if T is Send + Sync and Mutex is only Send + Sync if T is Send ...

P1: The reason those bounds get added is to say that the type inside the box must be send and sync in order to send the box across. But I’m not trying to send the box directly across, I have it hidden behind the Arc<Mutex<Box dyn>> so it should pick up that the arc and mutex satisfy the guarantee that there's only one accessor. It's like it's unpacking all the way down to the dyn type. I guess it's turtles all the way down.

The end is well put, it absolutely is turtles all the way down. The problem they ran into isn’t really a Send / Sync issue, it’s an issue with generics. When something works with a concrete type but doesn’t as a generic or dyn, that’s an indication your trait bound is incorrect. Remember, the Rust compile does a lot of inferencing. It will do its best to determine types, including any constraints the type satisfies.

In the case of the Inventory trait, it just so happened that every implementor was also Send + Sync. During type inferencing then, a type ConcreteInventory was really ConcreteInventory + Inventory + Send + Sync. When they went to use it then, the necessary traits were satisfied. However, once they switched to depending on the trait, well now you only depend on Inventory. Even though all Inventory implementations are Send + Sync, the compiler can’t depend on the concrete details. It can only work with what the trait offers, because nothing stops someone from implementing Inventory in a way that isn’t Send + Sync.

This type of issue comes up a lot in Rust in a way that is pretty distinct from other languages. Every language has the concept of “I can only do what’s provably true”, but Rust takes it further since there are more rules related to safety. As a result you will often encounter situations where the computer is just being very literal with you. It isn’t saying your code is unsafe, it isn’t saying the concrete types are unsafe, it’s saying the dependency you told it to take can’t be guaranteed to always be safe. The logic for when things are Send or Sync and generics are the same, this example just happens to invoke both simultaneously.

There wasn’t anything wrong with their code as compiled, but their abstractions were wrong. Allowing that to go unchecked would be misleading and would set a trap for future work to fall into. As a result, it gets flagged and rejected. In this case, the fix is to be more explicit. Either:

  1. All usages go from dyn Inventory to dyn Inventory + Send + Sync
  2. If the type should always offer these constraints, they can be mandated with Supertraits. trait Inventory { ... } becomes trait Inventory: Send + Sync { ... }
  1. By race condition, we mean non-deterministic behavior that can result in the incorrect output or behavior. This distinct from purely deterministic behaviors. For instance, if we want to download 5 files asynchronously, the order they finish in is non-deterministic. This isn’t a problem because we just wanted to download the files and that’s what we’ve done. Compare this to an integration test that starts a server then sends a request. This is a race condition. Without coordination we can’t ensure the server is actually done starting up before we send the request. We would end up with a flaky test that may or may not pass each time it is executed. ↩︎

  2. “Marker Traits” are traits used to mark that a type has a specific property. These can exist as annotations alone, or they can be special annotations that are baked into the compiler that allow it to make certain assumptions about the type. In practice these are often “auto traits”, meaning types automatically implement them when appropriate (and detectable). Like with all other usages of unsafe, the programmer can make an unsafe declaration to mark a type as having the properties that the marker indicates. This is used to let the compiler know properties of the type it isn’t capable of verifying on its own. See std::marker for a complete list of the currently supported traits. ↩︎

All rights reserved by the author.