This Isn't How PyPy Works, But it Might as Well Be

It seems like a lot of the Python programmers I speak with are deeply confused by PyPy and can't understand how it works. The stereotypical interlocutor will often say things like: A Python VM in Python? That's just crazy! How can that be fast? Isn't Python slower than C? Aren't all compilers written in C? How does it make an executable?

I am not going to describe to you how PyPy actually works. Lucky for you, I'm not smart enough to do that. But I would like to help you all understand how PyPy could work, and hopefully demystify the whole idea.

The people who are smart enough to explain how PyPy actually works will do it over at the PyPy blog. At some level it's really quite straightforward, but this impression of straightforwardness is not conveyed well by posts with titles like "Optimizing Traces of the Flow Graph Language". In addition to being a Python interpreter in Python, PyPy is a mind-blowingly advanced exploration of the cutting-est cutting-edge compiler and runtime technology, which can make it seem complex. In fact, the fact that it's in Python is what lets it be so cutting-edge.

Most people with a formal computer science background are already familiar with the fairly generic nature of compilers, as well as the concept of a self-hosting compiler. If you do have that background, then that's all PyPy is: a self-hosting compiler. The same way GCC is written in C, PyPy is written in Python. When you strip away the advanced techniques, that's all that's there.

A lot of folks who are confused by PyPy's existence, though, I suspect don't have that background; many working programmers these days don't. Or if they do, they've forgotten it, because the practical implications of the CSS box model are so complex that they squeeze simpler ideas, like turing completeness and the halting problem, out of the average human brain. So here's the easier explanation.

A compiler is a program that turns a string (source code: your program text written in Python, C, Ruby, Java, or whatever) into some kind of executable code (bytecode or runtime interpreter operations or a platform-native executable).

Let's examine that last one, since it seems to be a sticking point for most folks. A platform-native executable is simply a bunch of bytes in a file. There's nothing magic about it. It's not even a particularly complex type of file. It's a packed binary file, not a text file, but so are PNGs and JPEGs, and few programmers find it difficult to believe that such files might be created by Python. The formats are standard and very long-lived and there are tons of tools to work with them. If you're curious, even Wikipedia has a good reference for the formats used by each popular platform.

As to Python being slower than C: once a program has been transformed into executable code, it doesn't matter how slow the process for translating it was: the running program is now just executable instructions for your CPU, so it doesn't matter that Python is slower than C, because it was just the compiler that was in Python, and by the time your program is running, the original Python has effectively vanished and all you're left with is your program executing.

(Actually, Python is faster than C anyway, especially at producing strings.)

In reality, PyPy takes a hybrid approach, where it is a program which produces a program and then does some stuff to it and creates some C code which it compiles with the compiler of your choice and then creates some code which then creates other code and then puts it into memory, not a file, and then executes it directly, but all of that is ancillary tricks and techniques to make your code run faster, not a fundamental property of the kind of thing that PyPy is. Plus, as I said, this article isn't actually about how PyPy works anyway, it's just about how you should pretend it works. So you should ignore this whole paragraph.

For the sake of argument, assume that you know all the ins and outs of binary executable formats for different operating systems, and the machine code for various CPU architectures. The question you should really ask yourself is: if you have to write a program (a compiler) which translates one kind of string (source code) into another kind of string (a compiled program): would you rather write it in C or Python? What if the strings in question were a template document and an HTML page?

It shouldn't be surprising that PyPy is written in Python. For the same reasons that you might use Django templates and not snprintf for generating your HTML, it's easier to use Python than C to generate compiled code. This is why PyPy is at the forefront of so many advanced techniques that are too sophisticated to cover in a quick article like this. Since the compiler is written in a higher-level language, it can do more advanced things, since lower-level concerns can be abstracted away, just as they are in your own applications.