It seems like a lot of the Python programmers I speak with are deeply
confused by PyPy and can't understand how it works. The stereotypical
interlocutor will often say things like: A Python VM in Python? That's
just crazy! How can that be fast? Isn't Python slower than C?
Aren't all compilers written in C? How does it make an
executable?
I am not going to describe to you how PyPy actually works. Lucky for
you, I'm not smart enough to do that. But I would like to help you all
understand how PyPy
could work, and hopefully demystify the
whole idea.
The people who
are smart enough to explain how PyPy actually works
will do it over at
the PyPy
blog. At some level it's really quite straightforward, but this
impression of straightforwardness is not conveyed well by posts with titles
like "
Optimizing
Traces of the Flow Graph Language".
In addition to
being a Python interpreter in Python, PyPy is a mind-blowingly advanced
exploration of the cutting-est cutting-edge compiler and runtime technology,
which can make it seem complex. In fact, the fact that it's in Python is
what lets it be so cutting-edge.
Most people with a formal computer science background are already familiar
with the fairly generic nature of compilers, as well as the concept of a
self-hosting compiler. If you do have that background, then that's all
PyPy is: a self-hosting compiler. The same way GCC is written in C,
PyPy is written in Python. When you strip away the advanced
techniques, that's all that's there.
A lot of folks who are confused by PyPy's existence, though, I suspect don't
have that background; many working programmers these days don't. Or if
they do, they've forgotten it, because the practical implications of the CSS
box model are
so
complex that they squeeze simpler ideas, like turing completeness and
the halting problem, out of the average human brain. So here's the
easier explanation.
A compiler is a program that turns a string (source code: your program text
written in Python, C, Ruby, Java, or whatever) into some kind of executable
code (bytecode or runtime interpreter operations or a platform-native
executable).
Let's examine that last one, since it seems to be a sticking point for most
folks. A platform-native executable is simply a bunch of bytes in a
file. There's nothing magic about it. It's not even a particularly
complex type of file. It's a packed binary file, not a text file, but
so are PNGs and JPEGs, and few programmers find it difficult to believe that
such files might be created by Python. The formats are standard and
very long-lived and there are tons of tools to work with them. If
you're curious, even Wikipedia has a good reference for the formats used by
each
popular
platform.
As to Python being slower than C: once a program has been transformed into
executable code, it doesn't matter how slow the process for translating it
was: the running program is now just executable instructions for your CPU,
so it doesn't matter that Python is slower than C, because it was just the
compiler that was in Python, and by the time your program is running, the
original Python has effectively vanished and all you're left with is
your program executing.
In reality, PyPy takes a hybrid approach, where it is a program which
produces a program and then does some stuff to it and creates some C code
which it compiles with the compiler of your choice and then creates some
code which then creates
other code and then puts it into memory,
not a file, and then executes it directly, but all of that is ancillary
tricks and techniques to make your code run faster, not a fundamental
property of the kind of thing that PyPy
is. Plus, as I said,
this article isn't actually about how PyPy works anyway, it's just about how
you should pretend it works. So you should ignore this whole
paragraph.
For the sake of argument, assume that you know all the ins and outs of
binary executable formats for different operating systems, and the machine
code for various CPU architectures. The question you should really ask
yourself is: if you have to write a program (a compiler) which translates
one kind of string (source code) into another kind of string (a compiled
program): would you rather write it in C or Python? What if the
strings in question were a template document and an HTML page?
It shouldn't be surprising that PyPy is written in Python. For the
same reasons that you might use Django templates and not
snprintf for generating
your HTML, it's
easier to use Python than C to generate compiled
code. This is why PyPy is at the forefront of so many advanced
techniques that are too sophisticated to cover in a quick article like this.
Since the compiler is written in a higher-level language, it can do
more advanced things, since lower-level concerns can be abstracted away,
just as they are in your own applications.