StATiC DiSpATcH
Developer time is worth far more than computer time.
If you’re a typical Rust fan, you may feel very personally attacked today.
When you join a Rust code base for the first time at work, you are likely to be joining a group of people that are also new but have been working with the language slightly longer. If you do, you’ll invariably hit this PR comment the first time you use a Box<dyn T>
:
A primer on dispatch methods
In our context dispatch is the process of invoking a function. Specifically, it’s how the compiler generates the code for moving control over to the function. Let’s consider the steps to invoke a function:
- Push any arguments onto the stack
- Determine where to Jump
- Jump (meaning, set the instruction register to the start of the function). Normally, the CPU just executes the next instruction, so this value needs to be overridden. You can think of it as a
goto
.
Step 2 is what we mean by dispatch. When a function gets compiled, you can think of the output as an array of CPU instructions. The array is included in the executable at some well known location that never changes.
How does the computer know where the function is? It can know at compile-time or run-time.
Dynamic Dispatch
A best practice when writing code is to depend on abstractions, not concretions. In most languages, this is accomplished using an interface. This is an abstract principle that appears by many names: contract, interface, signature, API, spec. It appears across industries too. When you create an electrical device, you design it against an interface. In North America, it’s the standard 3 prong 120V AC outlet. By designing against the shape of what you interact with instead of the specific thing, your assembly is modular. Your laptop charger works just as well off wall power as it does off a portable battery or a gas generator.
But software is different. Software is trivial to change. The reason we abstractions is because of human factors. It’s easier for humans to coordinate and humans to properly design things this way. This isn’t always a concern for a compiler though. So even though every function in your code has an interface, the compiler may choose to strip these away.
Sometimes though, the compiler doesn’t know what needs to be called until run-time. Consider a dictionary of animals Dict<String, Animal>
where Animal is an interface. When a specific animal is fetched, the function to invoke depends on what the corresponding concrete implementation of the interface is. This means to allow for polymorphism the program needs to inspect the concrete “type” and then jump accordingly.
There are various strategies to accomplish this, but the one Rust uses is “wide pointers”. At a high level a second pointer is introduced along with a vtable
. The vtable
is an array of metadata about each implementation of the Trait and its function locations. With dynamic dispatch, the computer first jumps to the vtable
, finds the information it needs, then jumps again to the actual function.
The upside is this creates runtime flexibility and enables the power concepts like polymorphism and type erasure. The downside is this introduces overhead and can preclude the compiler from doing additional optimization.
It’s worth noting, dynamic dispatch is the default in most languages most of the time when you work with interfaces, and the more dynamic / managed / scripted the language the more likely other things are to be dynamic dispatch as well.
Static Dispatch
If the compiler can know precisely what function is being called, then it doesn’t need any kind of index check or lookup to find the function. We can decide at compile time. This means there is no step 2, the compiler just hard codes the memory address into the jump instruction.
This is, all else equal, faster. We have the fastest case every time, and the fastest case has at most one jump instead of the minimum of two in dynamic regimes. As we’ll see later though, the difference in this respect is almost always irrelevant. That aside the key element is that because everything is known at compile time, the compiler was able to make an optimization. It can make other, larger optimizations too.
Function inlining is the practice where the compiler will take the body of a function and “paste” it directly into the body of the caller. This can result in substantial improvements, because it avoids generating stack frames and the jumps into + out of the function. Thus we save 2 jumps, and any copying of arguments and return values1.
The downsides to this strategy are:
- Requiring knowing the type at compile time precludes type erasure and flexibility at run time.
- It forces more code to be generated, which can make the program slower. When you have static dispatch you have templates. You can think of templates as code with placeholders. Any time the template / generic is used with a new type, the compiler makes a full copy of all code with the given type substituted in for the placeholder. CPU performance is largely driven by cache hit rate. Having more code decreases what can fit in the cache and decreases the probability of a hit. At a minimum, the executable size and thus memory utilization increases.
- Most languages don’t have good support (or any at all) for using strong abstractions like interfaces this way (Rust being an exception).
Premature optimization is the root of all evil
I have a pet theory, which is that for many people Rust is their first exposure to this concept. It’s certainly likely to be the first positive experience they’ve ever had with “low level” programming. When you become aware of this new topic, it becomes a fun, neat goal to pursue. This shininess combined with unawareness of the bigger picture result in a fixation on what is almost always an irrelevant consideration.
The allure of zero cost abstractions
Part of why Rust is so fast despite having so much safety and ergonomics is that offers many zero-cost abstractions. This means that some high level concept available to the programmer happens for free at run-time. Better abstractions in programming languages often cause slightly slower code to be generated, but in Rust many times the abstraction results in the same generated code as if you had done it the hard way. We can eat our cake and have it too, so it’s no wonder this becomes a topic of focus in Rust development.
But… It’s faster… So why don’t you just do it?
Well, as we already covered above it isn’t always faster, and we only covered one aspect. And that gets to the central point. Should you accept that something is faster just because someone else said it is? Of course not. Once you require evidence, you’ll often find it doesn’t exist and is so laborious to gather it makes the proposed “optimization” moot either way.
I once had to waste 30 minutes in a PR comment debate thread with another senior engineer, who had 30 years experience, that insisted
if (foo)
is slower thanif (foo != FALSE)
, and that it was important that we “fix this” throughout the code base since it would “add up”. While that probably seems suspect to most readers, the thinking behind it2 is very similar to this topic and even relies on knowledge of how the CPU works at a hardware level (though its conclusions are false whereas dispatch claims are at least often true even if unimportant). You might not find this comparison compelling, but the principle is the same. An unsubstantiated claim is a dubious claim that must not be accepted.
Here’s a piece of advice, when someone says X complicated thing is always faster, they’re wrong and they don’t even realize just how wrong they are. Modern CPUs are engineering marvels of tremendous complexity. Do you know the fundamentals of assembly and pipelining? If not, then you don’t know what you’re talking about. Have you only done introductory digital logic study / computer architecture? You still don’t know what you’re talking about.
To be clear, I’m the furthest thing from an expert here. For starters, at the ISA level I only ever studied ARM and MIPS. I’m pretty clueless on x86. But in grad school one of my 2 specializations / research focuses was computer architecture. This gave me an introduction to just how complicated things get. As a group we created a fully working processor, and as an individual I developed arithmetic circuitry for hardware accelerated machine learning. In other words, I know just enough to know what I don’t know.
Modern processors can be: superscalar, pipelined, dynamically scheduled out of order execution (meaning it rewrites your program in real time as it reads it!), speculatively executing, and more.
If you’ve ever wondered why you can’t just look up how many clocks an instruction takes to execute, this is it. The answer is a wholly unsatisfying it depends. You’ll have to forgive me for establishing a bit of a tautology here:
If you don’t know what these things are, you don’t know what’s faster. If you do know what these are, then you know that the topic is too variable to give universal answers for.
This means we have to measure and/or have heuristics for when it’s probably faster. That all said, we do know that for the most part, static dispatch is usually faster. My point is to make sure people tune their level of conviction appropriately. So, for the sake of discussion let’s just assume we’re dealing exclusively with cases where it is indeed faster.
CPUs are stupid fast
I’m writing this on a computer with Apple’s M2 Max processor. This means it has:
- 8 P-cores (performance) up to 3.7GHz
- 4 E-cores (efficiency) up to 3.4GHz
- (also some hardware accelerated specialized circuits that are orders of magnitude faster that we’ll ignore today)
The G in GHz is “Giga” meaning billion. So the faster cores can perform 3.7 billion clocks per second each. That means each clock cycle is on the order of a quarter of a nanosecond! And remember, modern CPUs can do multiple instructions per clock per core. This is why3 you can’t just compare clock speed, each core type has different levels of optimization in it.
Humans can’t accurately perceive big numbers, let alone 10s of billions per second. Let’s clear this up with a contrived example:
1
2
3
4
5
6
7
8
9
10
11
12
trait MathOp {
fn op(&self, v1: i64, v2: i64) -> i64;
}
#[derive(Default)]
struct Add{}
impl MathOp for Add {
fn op(&self, v1: i64, v2: i64) -> i64 {
v1 + v2
}
}
With our math abstraction, let’s build a reduce function (in the map reduce sense) using static dispatch:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#[derive(Default)]
struct Reduce<Op: MathOp> {
operation: Op,
}
impl<Op: MathOp> Reduce<Op> {
pub fn reduce(&self, values: &Vec<i64>) -> i64 {
let mut reduced_term = 0;
for value in values {
reduced_term = self.operation.op(reduced_term, *value);
}
reduced_term
}
}
A dynamic dispatch variant would look nearly identical, just change:
1
2
3
4
5
6
7
8
9
struct BoxedReduce {
operation: Box<dyn MathOp>,
}
impl BoxedReduce {
pub fn new(operation: Box<dyn MathOp>) -> Self { Self { operation } }
pub fn reduce(&self, values: &Vec<i64>) -> i64 { ... }
}
What I’ve constructed here is just about the best case scenario for static dispatch over dynamic dispatch because the overhead of a function call / vtable lookup is a substantial portion of the overall execution time. As a result, simply switching the dispatch method improved performance as measured by cargo bench
by 203%4!
Let’s try running the same functions but with real executables with 4,000,000 values each this time instead of 1:
1
2
3
4
> time static_test
> static_test 0.00s
> time dynamic_test
> dynamic_test 0.00s
Hey, what gives?
Absolute vs relative improvements
Our 203% improvement is really exciting, until we notice the units from the single number bench run:
1
2
test dynamic_bench ... bench: 0.87 ns/iter (+/- 0.04)
test static_bench ... bench: 0.29 ns/iter (+/- 0.00)
Our impressive percentage is a relative improvement. While substantial, it’s an improvement on what is already an imperceptible amount of time. In absolute terms we’ve decreased execution time from a fraction of a nanosecond, to a fraction of a nanosecond. Yay?
To put that into perspective, that means from now to when you’ve finished reading this sentence, the amount of time saved per operation has passed another 17 billion times5. If you read at an average speed and skipped the code snippets, it’s happened 1.6 trillion times since you started reading this post6.
At what cost?
There’s no such thing as a free lunch, and there sure is shit isn’t one when it comes to hiring software engineers. In a Github blog post they noted that using Stack Overflow survey data you could estimate the average software engineer is $75/hr in total compensation. This is still an underestimate by the way, because there are a lot of additional per employee costs beyond pay and benefits.
Let’s say you and one other engineer waste just 5 minutes debating and then changing this code. How many times does the function need to run to break even? That places the up front engineering cost at $12.50
. If we consider a basic cloud server which costs $0.009/hr
, it comes out to 8,620 trillion invocations7.
Is this function really going to execute that many times? Don’t get me wrong, I’m not a fan of it’s a drop in the bucket type thinking. But you definitely have bigger fish to fry first, and if the pay off is too small then it’s not worth doing at all.
I’m not saying it’s never useful. We’ll discuss at the end when you should use static dispatch. But what I’m saying is more often than not, if you’re having this debate it’s over these kinds of numbers. This only matters in specific scenarios where you’ve profiled the code and can see this is causing appreciable issues.
Where Rust falls short
Ok, sure it doesn’t always matter, and it’s almost never worth wasting breath over in a code review, but if we’re writing new code why not make people write it the faster way next time? Then we don’t have to guess and it’s not like it’s any harder right?
No. While I have never felt compelled to say someone should switch from static to dynamic on the basis of complexity so far, it’s certainly not just as easy.
“You can opt into dynamic, but can never opt out”
A common piece of advice you will see is that in general library code should use static dispatch and application code dynamic dispatch. The thinking goes that the consumer of the library can choose to use dynamic dispatch, but if your library is written that way you’ve forced this choice on them. By choose, they mean you can force dynamic to occur by only using an instance over Box<dyn YourTrait>
which will force a single, dynamically dispatched implementation to be generated.
This isn’t true. It’s good advice overall, but it’s making too strong a claim. You can’t always use dynamic dispatch. See Object Safety. For some implementations, once you’ve picked static dispatch you’ve forced it on your consumers.
Substituting dyn
For impl
The introduction of impl
was controversial to some, but I think it’s the pinnacle of what Rust seeks to offer. Zero cost abstractions that don’t sacrifice ergonomics. What’s cool about traits is that they’re, mostly, the same independent of implementation. That is to say they’re used for any instance of “depend on an abstraction”. In most languages, generics are expressed with implicit duck typing at compile time with no notion of “bounds” the implementor can provide.
impl
is the next step in this process. While trait bounds allow us to use the same traits, they don’t allow the same terse syntax. Consider:
1
2
3
4
5
fn Foo<B: Bar>(bar: B);
fn Foo(bar: impl Bar);
fn Foo(bar: &dyn Bar);
The impl
example is so much more readable, even with this simple bound. The more bounds and/or generic terms we add the bigger this difference gets. Moreover, it’s just like the dynamic example. This makes it easier to read and easier to switch between the two. It means you can have the performance advantage, however small, without sacrificing on legibility. Often when you work with generics, you don’t actually care what it’s generic over. That generic bit is often an internal detail. In this case more so than any other the superiority of impl
rings clearly.
But, this syntax is illegal:
1
Arc<impl Bar>
As a result, using static dispatch poisons the code. I must use extra boilerplate, and every layer above must also be generic even if it doesn’t care about this implementation detail. Whereas with dyn
, it’s a hidden detail.
Obtuse boilerplate + testing annoyances
Rust doesn’t allow you to define alias for complex generic type bounds:
1
2
3
4
5
6
7
8
9
10
11
12
pub async fn wrap_service<S>(
address: SocketAddr,
service: S,
exe_version: &'static str,
) -> Self
where
S: Service<Request, Response = Response, Error = Infallble> + Send + Clone + 'static,
<S as Service<Request>>::Future: Send,
{
// Code adding more Tower layers to the service then starting it on a Tokio task
// and wrapping in a management object.
}
This is one of our simpler examples in a web server and look how obtuse it is. The generic bound is so complicated and relies on very specific use
declarations above. Any other functions passing around the same service needs all of the same boilerplate because
Here’s another fun block of syntax for a middleware:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
impl<S, B> Service<Request<B>> for AddResponseHeader<S>
where
S: Service<Request<B>> + Clone + Send + 'static,
B: Send + 'static,
S::Response: IntoResponse + Send,
S::Error: Into<Infallible> + Send,
S::Future: Send,
{
type Response = Response;
type Error = Infallible;
type Future = MiddlewareFuture;
/// ...
}
The annoyance is even more pronounced in test code where you both need this boilerplate and need to satisfy all these requirements with your fakes too and there’s no benefit to any performance improvement in test code that isn’t even being optimized.
Extract function? Ha what is this Java?
Related to not being able to type alias with impl
, remember that Rust requires explicit return types on all functions. Let’s say you have this code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Modified for brevity / example, this is an axum router.
let config = Arc::new(GovernorConfigBuilder::default()
.per_millisecond(1000 / 20)
.burst_size(20)
.finish()
.unwrap());
let limiter = Arc::new(config.limiter.clone());
interruptible_polling::fire_and_forget_polling_task(Duration::from_secs(300), move || {
limiter.retain_recent();
});
router = router
.nest(
"/merged-date",
build_merged_data_router(
Arc::new(FooDataFetcher::new("127.0.0.1:8889".to_string())),
Arc::new(BarDataFetcher::new("127.0.0.1:6666".to_string())),
),
)
.route(
"/thing",
get({
let client = thing_client.clone();
move |request| thing_reverse_proxy(client, request)
}),
)
.nest("/nested-data", nested_data_router(State { client: db_client, config: config }))
.layer(
GovernorLayer {
config
}
);
Later your requirements change, and you actually need to run two different web servers on two different ports, but the throttling logic should be the same for both. Easy enough, just extract that portion into a new function right? Except… A big part of why the static dispatched generics aren’t wholly an issue is that the compiler is inferring all the types for you when you consume builder patterns.
Now that you have to extract it, what on Earth is the type? One trick is to intentionally set it wrong to get the compiler to spit out the type but that’s going to spit out the full type with nested impl omissions. This isn’t helpful both ways. You can’t use the output because any nested impl abbreviations are illegal there and besides you want a generic. If you depend on the full stack, if a new middleware is added the type is now wrong.
Thus, we need to screw around with figuring out what the precise set of generic bounds the compiler inferred was and now requires here. I pick on this example because it demonstrates how much more work this syntax is. Extract function isn’t a new auto refactor concept, but I’m not aware of any tool to automatically handle this case in Rust. They can only do the simpler ones.
Net result? Toil. Particularly for anyone new to the topic. Is it impossible? No. Super hard? Also no. But it’s a phenomenal waste of time. Instead of working on the actual task at hand you’re playing 20 questions with the compiler about the trait bounds.
Fortunately, this got way better in axum 0.7 with simplified generics (which is what you saw above). In older versions, it was such a mess with so little working type inferencing you were better off just duplicating your code.
When to use static dispatch
When you don’t want type erasure, where having the full type information improves code quality When it’s natural When the overhead of a function call is a substantial portion of the over all cost
even here ^, it’s unlikely to matter. It’s mostly about what’s natural / good habit
In closing, pick which method makes the most sense ergonomically. Unless we can reason from the operation alone that overhead is a significant factor, don’t waste time quibbling over imperceivable optimizations. Your time, and that of your colleagues, is worth far more than the computer’s time.
It’s worth noting, even without inlining you can still have Return Value Optimization (RVO) where the compiler notices that the value is never used, only copied, so it optimizes by eliminating the temporary object. This is an extremely pervasive optimization and is even mandatory in later C++ standards. ↩︎
To briefly explain, there was an element of truth to this like 40 years ago, however insignificant then too. The first give away this is outdated thinking is the snippet was
!= FALSE
. The all caps indicate this person is using the typedef in Windows since older versions of C (and C++?) didn’t supportbool
as a type and thus there were notrue
andfalse
. This code is doing an integer comparison andFALSE
was defined as0
.
Computers sometimes have special case circuitry for special values. This could be because an shorter circuit for the special case exists to use or because the nature of the operation means you can check the result sooner than you normally would (in digital logic, the purpose of the clock signal is it ensures enough time has passed for state changes to ripple through the full circuit).
You also have different types of conditional jumps. If the value you compare against is 0 there’s often a special variant that runs faster because it doesn’t need to load the number in to compare against. If you had to compare full numbers, there’s too much data and the processor will first do a compare, then the conditional jump based on the output. As always, we’re speaking in generalities here it depends on which ISA is in use.
Their claim relied on two ideas: the conditional jump on!= 0
existed and the compiler wasn’t smart enough to know becauseint
s were used notbool
s. This got walked back because A) the example in question was using bools and B) the compiler is more than smart enough to notice this. He eventually walked back to “well just in case it doesn’t we should do it”. This is stupid because it’s a waste of time, not even a real concern, and how do you know the compiler won’t do the opposite? The compiler has no obligation to respect the way you’ve defined the comparison! ↩︎Thermal throttling is a concern too. For instance, my laptop won’t hit 3.7GHz on a full load for very long. It will get too hot and start lowering the clock speed to compensate. ↩︎
You can find the full code here, but I wanted to cover some caveats. Ironically, it can be hard to quickly demonstrate the point that you shouldn’t care much about this stuff since computers are fast and compilers are so good at optimization… because computers are fast and compilers are so good at optimization. There are some peculiar things to notice.
Originally, I was going to compare benches of single vs millions of operations. Remember how we discussed that allowing the compiler to inline unlocks a bunch of other optimizations? Well it unlocked some unintended ones. The full result was:1 2 3 4
test dynamic_bench ... bench: 0.87 ns/iter (+/- 0.04) test full_dynamic_bench ... bench: 6,991,175.00 ns/iter (+/- 152,534.94) test full_static_bench ... bench: 0.29 ns/iter (+/- 0.00) test static_bench ... bench: 0.29 ns/iter (+/- 0.00)
The bench says that running the static operation 4 million times takes exactly as long as running it 1 time. I didn’t have time to investigate why exactly this is, but I suspect that because the bench uses fixed values the compiler decided to pre-compute the result of the test and removed all the code, not just the function calls.
But wait, there’s more. I went to remove those benches + theReduce
andSubtract
structs since it wasn’t a valid test and got this result:1 2
test dynamic_bench ... bench: 0.29 ns/iter (+/- 0.03) test static_bench ... bench: 0.29 ns/iter (+/- 0.01)
Again, I didn’t bother looking but I suspect that this simplified everything so much that the compiler saw there was only one possible function and ignored my instruction to use dynamic dispatch. While this is a particularly outlandish example, this general idea is why it’s wise to do functional programming (iterators, heavily layered lambdas like with Tokio services, etc) with static dispatch. In those cases the abstraction is a larger percentage of execution time, but moreover having such chopped up code decreases the ability of the compiler to do creative optimizations like it obviously did here.
If this kind of topic interests you, I strongly encourage playing around with the code sample. Add and remove implementations, change the number of loop iterations, look at the generated assembly. You can learn a lot. I wanted to include this for the sake of honesty, but also to truly instill in you just how complicated of a topic micro-optimization of code is.
But most of all, I want to leave you with this. Even in the case where I accidentally made it ~10 million times slower than I meant to, it still executed 4 million times in just 7 milliseconds. Computers are really, really fast. Until you have a performance issue and have profiled the code, this isn’t a topic you should be wasting time on in a code review. The reason so much of this post focuses on why static dispatch is bad or pointless is because it doesn’t need any championing. The community is already quite fond of it. ↩︎Average reading speed 130 words per minute with 21 words yields \((21/130*60)Gns / (.87-.29)ns = 16.71G\) ↩︎
~2100 words up until this point with average reader speed yields \((2100/130*60)Gns / (.87-.29)ns = 1600.71G\) ↩︎
A little more math this time, but still straightforward:
cost/ns = \($0.009hr * 60m * 60s * 1,000,000,000ns = $2.5E-15\)
saved/invocation = \($2.5E-15 * (.87-.29) = $1.45E-15\)
break even at: \($12.50 / $1.45E-15 = 8,620,689,655,172,413.7931034483\) invocations ↩︎