In this post I’d like to convince you that you should be running
Mypyc over your code —
especially if your code is a library you upload to PyPI — for both your own
benefit and that of the Python ecosystem at large.
But first, let me give you some background.
Python is Slow, And That’s Fine, Because It’s Fast Enough
A common narrative about Python’s value proposition, from the very earliest
days of the language, often recited in response to a teammate saying
“shouldn’t we just write this in $HIGHER_PERFORMANCE_LANGUAGE
instead?” goes
something like this:
Sure, Python is slow.
But that’s okay, because it saves you so much time over implementing your
code in $HIGHER_PERFORMANCE_LANGUAGE
that you’ll have so much more time for
optimizing those critical hot-spots where performance is really critical.
And if the language’s primitives are too slow to micro-optimize those
hot-spots enough, that’s okay too, because you can always re-write just those
small portions of the program as a C extension module.
Python’s got you covered!
There is some truth to this narrative, and I’ve quoted from it myself on many
occasions. When I did so, I was not quoting it as some facile, abstract
hypothetical, either. I had a few projects, particularly very early in my
Python career, where I replaced performance-critical C++ code with a one tenth
the number of lines of Python, and improved performance by orders of magnitude
in the process.
When you have algorithmically interesting, performance-sensitive code that can
benefit from a high-level expressive language, and the resources to invest in
making it fast, this process can be counterintuitively more efficient than
other, “faster” tools. If you’re working on massively multiplayer online
games or something equally technically challenging, Python can be a
surprisingly good idea.
But… Is It Fine, Though?
This little nugget of folk wisdom does sound a bit defensive, doesn’t it? If
Python were just fast, you could just use it, you wouldn’t need this litany
of rationalizations. Surely if we believed that performance is
important in our
own Python code, we wouldn’t try to wave away the performance of Python itself.
Most projects are not massively multiplayer online games. On many
straightforward business automation projects, this sort of staged approach to
performance is impractical.
Not all performance problems are hot
spots.
Some programs have to be fast all the way through. This is true of some
complex problems, like compilers and type checkers, but is also often the case
in many kinds of batch processing; there are just a lot of numbers, and you
have to add them all up.
More saliently for the vast majority of average software projects, optimization
just isn’t in the budget. You do your best on your first try and hope that none
of those hot spots get too hot, because as long as the system works within a
painfully generous time budget, the business doesn’t care if it’s slow.
The progression from “idiomatic Python” to “optimized Python” to “C” is a
one-way process that gradually loses the advantages that brought us to Python
in the first place.
The difficult-to-reverse nature of each step means that once you have
prototyped out a reasonably optimized data structure or algorithm, you need to
quasi-permanently commit to it in order to squeeze out more straight-line
performance of the implementation.
Plus, the process of optimizing Python often destroys its readability, for a
few reasons:
- Optimized Python relies on knowledge of unusual tricks. Things like “use
the
array
module instead of lists”, and “use %
instead of
.format
”.
- Optimized Python requires you to avoid the things that make Python code
nicely organized:
- method lookups are slow so you should use functions.
- object attribute accesses are slow so you should use tuples with
hard-coded numeric offsets.
- function calls are slow so you should copy/paste and inline your logic
- Optimized Python requires very specific knowledge of where it’s going to be
running, so you lose the flexibility of how to run it: making your code fast
on CPython might make it much slower on PyPy, for example. Native extension
modules can make your code faster, but might also make it fail to run inside
a browser, or add a ton of work to get it set up on a new operating system.
Maintaining good performance is part of your software’s development lifecycle,
not just a thing you do once and stop. So by moving into this increasingly
arcane dialect of “fast” python, and then into another programming language
entirely with a C rewrite, you end up having to maintain C code anyway. Not to
mention the fact that rewriting large amounts of code in C is both ludicrously
difficult (particularly if your team primarily knows Python) and also
catastrophically dangerous. In recent
years, safer tools such as PyO3 have become
available, but they still involve switching
programming languages and rewriting all your code as soon as you care about
speed.
So, for Python to be a truly general-purpose language, we need some way to
just write Python, and have it be fast.
It would benefit every user of Python for there to be an easy, widely-used
way to make idiomatic, simple Python that just does stuff like adding numbers,
calling methods, and formatting strings in a straight line go really fast —
exactly the sorts of things that are the slowest in Python, but are also the
most common, particularly before you’ve had an opportunity to cleverly
optimize.
We’ve Been Able To At Least Make Do
There are also a number of tools that have long been in use for addressing this
problem: PyPy, Pyrex, Cython, Numba, and Numpy to name a few. Their
maintainers all deserve tremendous amounts of credit, and I want to be very
clear that this post is not intended to be critical of anyone’s work here.
These tools have drawbacks, but many of those drawbacks make them much better
suited to specialized uses beyond the more general 80% case I’m talking about
in this post, for which Mypyc would not be suitable.
Each one of these tools impose limitations on either the way that you write
code or where you can deploy it.
Cython and Numba aren’t really “Python” any more, because they require
special-purpose performance-oriented annotations. Cython has long supported
pure-Python type annotations, but you won’t get any benefit from telling it
that your variable is an int
, only a cython.int
. It can’t optimize a
@dataclass
, only a @cython.cclass
. And so on.
PyPy gets the closest — it’s definitely regular Python — but its strategy has
important limitations. Primarily, despite the phenomenal and heroic effort
that went into
cpyext
,
it seems like there’s always just one PyPy-incompatible
library
in every large, existing project’s dependency list which makes it impossible to
just drop in PyPy without doing a bunch of arcane debugging first.
PyPy might make your program magically much faster, but if it doesn’t work,
you have to read the tea leaves on the JIT’s behavior in a profiler which
practically requires an online component that doesn’t even work any
more. So mostly you just simplify your code to use more
straightforward data structures and remove CPython-specific tricks that might
trip up the JIT, and hope for the best.
PyPy also introduces platform limitations. It’s always — understandably, since
they have to catch up after the fact — lagging a bit behind the most recently
released version of CPython, so there’s always some nifty language feature that
you have to refrain from using for at least one more release cycle.
It also has architectural limitations. For example, it performs quite poorly
on an M1 Mac since it still runs under x86_64 emulation on that platform. And
due to iOS forbidding 3rd-party JITs, it won’t ever be able to provide better
performance in one of the more constrained environments that needs it more that
other places. So you might need to rely on CPython on those platforms anyway…
and you just removed all your CPython-specific hacks to try to please the JIT
on the other platforms you support.
So while I would encourage everyone to at least try their code on PyPy — if
you’re running a web-based backend, it might save you half your hardware
budget — it’s not going to solve “python is slow” in the general case.
It’ll Eventually Be All Right
This all sounds pretty negative, so I would be remiss if I did not also point
out that the core team is well aware that Python’s default performance needs
to be better, and Guido van Rossum literally came out of retirement for one
last
job to
fix it, and
we’ve already seen a bunch of benefits from that effort.
But there are some fundamental limitations on the long-term strategy for these
optimizations; one of the big upcoming improvements is a JIT, which suffers
from some (but not all) of the same limitations as
PyPy,
and the late-bound, freewheeling nature of Python inherently comes with some
performance tradeoffs.
So it would still behoove us to have a strategy for production-ized code that
gives good, portable, ahead-of-time performance.
But What About Right Now?
Mypyc takes the annotations meant
for Mypy and generates C with them, potentially turning your code into a much
more efficient extension module. As part of Mypy itself, it does this with
your existing Python type-hints, the kind you’d already use Mypy with to check
for correctness, so it doesn’t entail much in the way of additional work.
I’d been curious about this since it was initially
released, but
I still haven’t had a hard real-world performance problem to really put it
through its paces.
So when I learned about the High Throughput Fizzbuzz
Challenge
via its impressive assembler
implementation
that achieves 56GiB/s, and I saw even heavily-optimized Python
implementations sitting well below the performance of a totally naïve C
reference implementation, I thought this would be an interesting miniature
experiment to use to at least approximate practical usage.
In Which I Design A Completely Unfair Fight Which I Will Then Handily Win
The dizzying heights of cycle-counting hand-tuned assembler implementations of
this benchmark are squarely out of our reach, but I wanted to see if I could
beat the performance of this very naïve C implementation with Python that was
optimized, but at least, somewhat idiomatic and readable.
I am about to compare a totally naïve C implementation with a fairly optimized
hand-tuned Python one, which might seem like an unfair fight. But what I’m
trying to approximate here is a micro-instance of the real-world
development-team choice that looks like this:
Since Python is more productive, but slower, the effort to deliver each of
the following is similar:
- a basic, straightforward implementation of our solution in C
- a moderately optimized Python implementation of our solution
and we need to choose between them.
This is why I’ll just be showing naïve C and not unrolling any loops; I’ll use
-O3
because any team moderately concerned with performance would at least
turn on the most basic options, but nothing further.
Furthermore, our hypothetical team also has this constraint, which really every
reasonable team should:
We can trade off some readability for efficiency, but it’s important that
our team be able to maintain this code going forward.
This is why I’m doing a bit of optimizing in Python but not going all out by
calling mmap
or pulling in numpy
or attempting to use something super
esoteric like a SIMD library to emulate what the assembler implementations do.
The goal is that this is normal Python code with a reasonable level of
systems-level understanding (i.e. accounting for the fact that pipes have
buffers in the kernel and approximately matching their size maximizes
throughput).
If you want to see FizzBuzz pushed to its limit, you can go check out the
challenge
itself.
Although I think I do coincidentally beat the performance of the Python
versions they currently have on there, that’s not what I’m setting out to do.
So with that elaborate framing of this slightly odd experiment out of the way,
here’s our naïve C version:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 | #include <stdio.h>
int main() {
for (int i = 1; i < 1000000000; i++) {
if ((i % 3 == 0) && (i % 5 == 0)) {
printf("FizzBuzz\n");
} else if (i % 3 == 0) {
printf("Fizz\n");
} else if (i % 5 == 0) {
printf("Buzz\n");
} else {
printf("%d\n", i);
}
}
}
|
First, let’s do a quick head-to-head comparison with a naïve Python
implementation of the algorithm:
1
2
3
4
5
6
7
8
9
10
11
12
13
14 | def fizzbuzz() -> None:
for counter in range(1, 1000000000):
fizz = counter % 3 == 0
buzz = counter % 5 == 0
if fizz:
print("Fizz", end="")
if buzz:
print("Buzz", end="")
if not (fizz or buzz):
print(counter, end="")
print()
if __name__ == "__main__":
fizzbuzz()
|
Running both of these on my M1 Max MacBook, the naïve C implementation yields
127 MiB/s of Fizzbuzz output. But, as I said, although we’re not going to have
time for testing a more complex optimized C version, we would want to at
least build it with the performance benefits we get for free with the -O3
compiler option. It turns out that yields us a 27 MiB/s speedup. So 154 MiB/s
is the number we have to beat.
The naïve Python version achieves a dismal 24.3 MiB/s, due to a few
issues. First of all, although it’s idiomatic, print()
is doing a lot of
unnecessary work here. Among other things, we are encoding Unicode, which the
C version isn’t. Still, our equivalent of adding the -O3
option for C is
running mypyc
without changing anything, and that yields us a 6.8MiB/s
speedup immediately. We still aren’t achieving comparable performance, but a
roughly 25% performance improvement for no work at all is a promising start!
In keeping with the “some optimizations, but not so much that it’s illegible”
constraint described above, the specific optimizations I’ve chosen to pursue
here are:
- switch to using
bytes
objects and sys.stdout.buffer
to avoid encoding overhead
- take advantage of the repeating nature of the pattern in FizzBuzz output and
pre-generate a template rather than computing each line independently
- fill out the buffer with the relevant integers from a sequence as we go
- tune the repetition of that template to a size that roughly fills a pipe
buffer on my platform of choice
Hopefully, with that explanation, this isn’t too bad:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44 | from sys import stdout
from typing import Tuple, Iterable
def precompute_template() -> Iterable[bytes]:
for counter in range(1, 16):
fizz = counter % 3 == 0
buzz = counter % 5 == 0
if fizz:
yield b"Fizz"
if buzz:
yield b"Buzz"
if not (fizz or buzz):
yield b"%d"
yield b"\n"
chunk_copies = 4
precomputed_template_chunks = list(precompute_template())
format_string = b"".join(precomputed_template_chunks)
number_indexes = [
number_index
for number_index, line_content in enumerate(format_string.split(b"\n"))
if line_content == b"%d"
]
format_string *= chunk_copies
def fizzbuzz() -> None:
num: int = 1
output = stdout.buffer.write
for num in range(1, 1000000001, 15 * chunk_copies):
t: Tuple[int, ...] = tuple(
(
x + number_index
for x in range(num, num + (15 * chunk_copies), 15)
for number_index in number_indexes
)
)
output(format_string % t)
if __name__ == "__main__":
fizzbuzz()
|
Running this optimized version actually gets us within the ballpark of the
naïve C version, even beating it by a hair; my measurement was 159 MiB/s, a
small improvement even over -O3
. So, per the “litany against C” from the
beginning of this post, algorithmic optimization of Python really does help a
lot; it’s not just a rationalization. This is a much bigger boost than our
original no-effort Mypyc run, giving us more like an 85% speedup; definitely
bigger than 25%.
But clearly we’re still being slowed down by Python’s function call overhead,
object allocations for small integers, and so on, so Mypyc should help us out
here: and indeed it does. On my machine, it nets a whopping 233 MiB/s. Now
that we are accounting for performance and optimizing a bit, Mypyc’s relative
advantage has doubled to a 50% improvement in performance on both the
optimized-but-interpreted Python and naïve C versions.
It’s worth noting that the technique I used to produce the extension modules to
test was literally pip install mypy; mypyc .../module.py
, then python -c
“import module”
. I did already have a C compiler installed, but other than
that, there was no setup.
I just wrote Python, and it just worked.
The Call To Adventure
Here’s what I want you to take away from all this:
- Python can be fast.
- More importantly, your Python can be fast.
- For a fairly small investment of effort, your Python code can be made
meaningfully faster.
Unfortunately, due to the limitations and caveats of existing powerful
performance tools like Cython and PyPy, over the last few years in the Python
community a passive consensus has emerged. For most projects, in most cases,
it’s just not worth it to bother to focus on performance. Everyone just uses
the standard interpreter, and only fixes the worst performance regressions.
We should, of course, be glad that the standard interpreter is reliably
getting faster all the time
now,
but we shouldn’t be basing our individual libraries’ and applications’
performance strategies on that alone.
The projects that care the most about performance have made the effort to use
some of these tools, and they have often invested huge amounts of effort to
good effect, but often they care about performance too much. They make the
problem look even harder for everyone else, by essentially stipulating that
step 1 is to do something extreme like give up and use Fortran for all the
interesting
stuff.
My goal with this post is to challenge that status quo, spark interest in
revisiting the package ecosystem’s baseline performance expectations, and to
get more projects — particularly libraries on PyPI — to pick up Mypyc and
start giving Python a deserved reputation for being surprisingly fast.
The Last Piece of the Puzzle
One immediate objection you might be thinking of is the fact that, under the
hood, Mypyc is emitting some C code and building it, and so this might create
a problem for deployment: if you’ve got a Linux machine but 30% of your users
are on Windows, moving from pure-Python to this hybrid workflow might create
installation difficulties for them, or at least they won’t see the benefits.
Luckily a separate tool should make that a non-issue:
cibuildwheel
. “CI Build
Wheel”, as its name suggests, lets you build your wheels in your continuous
integration system, and upload those builds automatically upon tagging a
release.
Often, the bulk of the work in using it is dealing with the additional
complexities involved in setting up your build environment in CI to make sure
you’re appropriately bundling in any native libraries you depend upon, and
linking to them in the correct way. Mypyc’s limitation relative to Cython is a
huge advantage here: it doesn’t let you link to other native libraries, so
you can always skip the worst step here.
So, for maintainers, you don’t need to maintain a pile of janky VMs on your
personal development machine in order to serve your users. For users, nobody
needs to deal with the nightmare of setting up the right C compiler on their
windows machine, because the wheels are prebuilt. Even users without a
compiler who want to contribute new code or debug it can run it with the
interpreter locally, and let the cloud handle the complicated compilation steps
later. Once again, the fact that you can’t require additional, external C
libraries here is a big advantage; it prevents you from making the user’s
experience inadvertently worse.
cibuildwheel
supports all major operating systems and architectures, and
supported versions of Python, and even lets you build wheels for PyPy while
you’re at it.
Putting It All Together
Using Mypyc and cibuildwheel
, we, as PyPI package maintainers, can
potentially produce an ecosystem of much faster out-of-the-box experiences via
prebuilt extension modules, written entirely in Python, which would make the
average big Python application with plenty of dependencies feel snappier than
expected. This doesn’t have to come with the pain that we have unfortunately
come to
expect
from C extensions, either as maintainers or users.
Another nice thing is that this is not an all-or-nothing proposition. If you
try PyPy and it blows up in some obscure way on your code, you have to give up
on it unless you want to fully investigate what’s happening. But if you trip
over a bug in Mypyc, you can report the bug, drop the module where you’re
having the problem from the list of things you’re trying to compile, and move
on. You don’t even have to start out by trying to jam your whole project
through it; just pick a few key modules to get started, and gradually expand
that list over time, as it makes sense for your project.
In a future post, I’ll try to put all of this together myself, and hopefully
it’s not going to be embarrassingly difficult and make me eat my words.
Despite not having done that yet, I wanted to put this suggestion out now, to
get other folks thinking about getting started with it. For older
projects, retrofitting all the existing infrastructure to put Mypyc in
place might be a bit of a challenge. But for new projects starting today,
putting this in place when there’s very little code might be as simple as
adding a couple of lines to pyproject.toml
and copy-pasting some
YAML into
a Github workflow.
If you’re thinking about making some new open source Python, give Mypyc a try,
and see if you can delight some users with lightning speed right out of the
box. If you do, let me know how it turns
out.
Acknowledgments
Thanks to Donald Stufft, Moshe Zadka, Nelson Elhage, Itamar Turner-Trauring,
and David Reid for extensive feedback on this post. As always, any errors or
inaccuracies remain my own.