Simple Made Variadic

Last night I made a snarky tweet about how Clojure is doomed.  Out of context, it didn't really make a lot of sense, and Ivan Krstić replied asking what the heck I was talking about.  I tried to fit the following into a tweet but it kinda broke the tweeterizer I was using right in half, and so I had to put it here.

I love a good snark as much as the next person - some might say more - but it really bothers me when people make snide comments denigrating others' free work without at least offering a cogent criticism to go with it, and I don't want to be that guy.  So, hopefully before the whole Clojure community finds said tweet and writes me off as an arrogant Python bigot, I would like to explain what I meant in a bit more detail.

Right off the bat I should say that this was a bit tongue-in-cheek.  I actually rather like Clojure and I think Rich Hickey has some very compelling ideas about programming in general.  I watch his talk "Simple Made Easy" once every month or two, contemplating its deeper meaning, and I still usually come away with an insight or two.

I should also make it clear that I was linking to the recur special form specifically, and not just the special forms documentation in general.  Obviously having reference docs isn't a bad thing.

Ivan, (or should I say, "@radian"?) you may be right; that documentation you linked to may indeed one day spell Python's doom.  If Python does eventually start to suck, it will be because it collapsed under the weight of its own weird edge cases like the slightly-special behavior of operator dispatch as compared to method dispatch, all the funkiness of descriptors, context managers, decorators, metaclasses, et cetera.

A portion of my point that was serious, though.  The documentation for recur does highlight some problems with Clojure that the Python docs can play an interesting counterpoint to.

The presence of the recur form at all is an indication of the unhealthy level of obsession that all LISPs have with recursion.  We get it: functions are cool. You can call them. They can call themselves. Every other genre of language manages to use this to a reasonable degree of moderation without adding extra syntactic features to their core just so you can recurse forever without worrying about stack resources. Reading this particular snipped of documentation, I can almost hear Rich Hickey cackling as he wrote it, having just crowned himself God-Emperor of the smug LISP weenies, as he gleefully points out that Scheme has it wrong and the CL specifications had it wrong with respect to tail call elimination, and that it should be supported by the language but also be explicit and compiler-verified.

The sad thing is, my hypothetical caricature of Clojure's inventor is actually right! This is a uniquely clever solution to a particularly thorny conceptual problem with the recursive expression of algorithms. The Scheme folks and the Common Lisp folks did both kinda get it wrong.  But the fact that this has to be so front-and-center in the language is a problem. Most algorithms shouldn't be expressed recursively; it is actually quite tricky to communicate about recursive code, and anyway most systems that really benefit from it have to be mutually, dynamically reentrant anyway and won't be helped by tail call elimination.  (My favorite example of this is still the unholy shenanigans that Deferreds have to get up to to make sure you don't have to care about callback chain length or Deferred return nesting depth.)

Also, if you want to be all automatically parallelizable and web scale and "cloud"-y, recursion and iteration are both the wrong way to do it; they're both just ways of tediously making your way down a list of things one element at a time.  What you want to do is to declaratively apply a computation to your data in such a way as to avoid saying anything about the order things have to happen in.  To put it more LISPily, (map) is a better conceptual foundation for the future than (loop) or (apply).  Of course you can do the naive implementation of (map) with (recur), but smarter implementations need application code to be written some other way.

The language style choices of the manual in this case is also telling. The Python docs that Ivan linked to go into excruciating detail, rephrasing and explaining the same concept in a few different ways, linking to other required concepts in depth so the reader can easily familiarize themselves with any prerequisites, while still essentially explaining a nerdy part of the language that you can ignore while still using it productively. Every Python programmer ignores descriptors while they're learning to write classes and methods, despite that they're using them all the time; this ability to be understood at different levels of complexity is a strength of every good language, and python does particularly well in that regard. Of course one could also make the case that this is just because Python has so many dusty corners hidden behind double-underscores, and a better language would just have less obscure junk in it, not make understanding the obscure junk optional, but I digress.

The description of recur, by contrast, is deeply flawed.  It is terse, to a fault. It introduces the concept of "recursion points" without linking to any kind of general reference.  It uses abbreviations all over the place ("exprs", "params", "args", "seq") without even using typesetting to illuminate whether they are using a general term or a specific variable name.

But, by far, the worst sin of this document is the use of the words "variadic" and "arity". There is really no excuse for the use of these words, ever. Take it from me, I am exactly the kind of pedantic jerk who will drop "arity" into a conversation about, for example, music theory, just to demonstrate that I can, and as that kind of jerk I can tell you with certainty: I have no excuse.

It should say: "a function that takes a variable number of arguments". Or possibly: "it must take exactly the same number of arguments as the recursion point".

This was particularly disappointing example to me because Clojure strikes me as a particularly... for lack of a better word, "optimistic" lisp, one that looks forward rather than back, one that is more interested in helping people find comprehensible and composeable ways to express programs than in revisiting obscure debates over reader macros or lisp-1/lisp-2 holy wars.  But the tone of the documentation for recur aims it straight at the smug lisp weenie crowd.

As I hope is obvious, if not initially, then at least by now, I don't think that Clojure will fail (or succeed) on the merits of one crummy piece of documentation. It's a much younger language than Python, so it may have a ways to go in its documentation practices. It also comes from an illustrious heritage that I can't expect to see none of in the way that it talks about itself, no matter how unfortunate certain details of that heritage are.  Heck, at the parallel point in Python's lifetime, it didn't even have descriptors yet, let alone the documentation for them!

Still, I don't think that this issue is entirely trivial, and I hope that the maintainers for the documentation for Clojure, at least the documentation for the parts of the language you have to see every day, take care to improve its accessibility to the less arcane among us.

We'll Always Have Cambridge

Half-way through 2012, I will be leaving the east coast.

There are a great many things I despair of leaving behind; family, friends, the most excellent Boston Python Meetup, participating in the sometimes incendiary, sometimes hilarious Cambridge, Massachusetts / Cambridge, England rap war.

However, I'm not writing today in order to wax lyrical about the area, or to extoll the virtues of my new home, but hopefully, to prevent a missed opportunity.  I know there are at least a few really cool people in Massachusetts who read this blog, and who read my tweets, that I either haven't seen in quite a while or have never actually met in person.

So if grabbing a coffee with me is an interesting idea to you, please drop me a line within the next month. I would love to hear your story about how PHP ruined your summer, or how Twisted changed your life, or how you once pwned a vending machine with nothing but a malformed JPEG.

I'm sure I'll visit the area from time to time, so this isn't quite your last chance, but it just won't be the same, you know?

If I follow you, you can DM me on Twitter of course, but my email address isn't hard to figure out either.  If you glance up towards the top of your browser window right now, you're practically looking at it.

This Isn't How PyPy Works, But it Might as Well Be

It seems like a lot of the Python programmers I speak with are deeply confused by PyPy and can't understand how it works.  The stereotypical interlocutor will often say things like: A Python VM in Python?  That's just crazy!  How can that be fast?  Isn't Python slower than C?  Aren't all compilers written in C?  How does it make an executable?

I am not going to describe to you how PyPy actually works.  Lucky for you, I'm not smart enough to do that.  But I would like to help you all understand how PyPy could work, and hopefully demystify the whole idea.

The people who are smart enough to explain how PyPy actually works will do it over at the PyPy blog.  At some level it's really quite straightforward, but this impression of straightforwardness is not conveyed well by posts with titles like "Optimizing Traces of the Flow Graph Language".  In addition to being a Python interpreter in Python, PyPy is a mind-blowingly advanced exploration of the cutting-est cutting-edge compiler and runtime technology, which can make it seem complex. In fact, the fact that it's in Python is what lets it be so cutting-edge.

Most people with a formal computer science background are already familiar with the fairly generic nature of compilers, as well as the concept of a self-hosting compiler.  If you do have that background, then that's all PyPy is: a self-hosting compiler.  The same way GCC is written in C, PyPy is written in Python.  When you strip away the advanced techniques, that's all that's there.

A lot of folks who are confused by PyPy's existence, though, I suspect don't have that background; many working programmers these days don't.  Or if they do, they've forgotten it, because the practical implications of the CSS box model are so complex that they squeeze simpler ideas, like turing completeness and the halting problem, out of the average human brain.  So here's the easier explanation.

A compiler is a program that turns a string (source code: your program text written in Python, C, Ruby, Java, or whatever) into some kind of executable code (bytecode or runtime interpreter operations or a platform-native executable).

Let's examine that last one, since it seems to be a sticking point for most folks.  A platform-native executable is simply a bunch of bytes in a file. There's nothing magic about it.  It's not even a particularly complex type of file.  It's a packed binary file, not a text file, but so are PNGs and JPEGs, and few programmers find it difficult to believe that such files might be created by Python.  The formats are standard and very long-lived and there are tons of tools to work with them.  If you're curious, even Wikipedia has a good reference for the formats used by each popular platform.

As to Python being slower than C: once a program has been transformed into executable code, it doesn't matter how slow the process for translating it was: the running program is now just executable instructions for your CPU, so it doesn't matter that Python is slower than C, because it was just the compiler that was in Python, and by the time your program is running, the original Python has effectively vanished and all you're left with is your program executing.

(Actually, Python is faster than C anyway, especially at producing strings.)

In reality, PyPy takes a hybrid approach, where it is a program which produces a program and then does some stuff to it and creates some C code which it compiles with the compiler of your choice and then creates some code which then creates other code and then puts it into memory, not a file, and then executes it directly, but all of that is ancillary tricks and techniques to make your code run faster, not a fundamental property of the kind of thing that PyPy is.  Plus, as I said, this article isn't actually about how PyPy works anyway, it's just about how you should pretend it works.  So you should ignore this whole paragraph.

For the sake of argument, assume that you know all the ins and outs of binary executable formats for different operating systems, and the machine code for various CPU architectures.  The question you should really ask yourself is: if you have to write a program (a compiler) which translates one kind of string (source code) into another kind of string (a compiled program): would you rather write it in C or Python?  What if the strings in question were a template document and an HTML page?

It shouldn't be surprising that PyPy is written in Python.  For the same reasons that you might use Django templates and not snprintf for generating your HTML, it's easier to use Python than C to generate compiled code.  This is why PyPy is at the forefront of so many advanced techniques that are too sophisticated to cover in a quick article like this.  Since the compiler is written in a higher-level language, it can do more advanced things, since lower-level concerns can be abstracted away, just as they are in your own applications.

The Concurrency Spectrum: from Callbacks to Coroutines to Craziness


Concurrent programming idioms are on a spectrum of complexity.

Obviously, writing code that isn't concurrent in any way is the easiest.  If you never introduce any concurrent tasks, you never have to debug any problems with things running in an unexpected order.  But, in today's connected world, concurrency of some sort is usually a requirement.  Each additional point where concurrency can happen introduces a bit of cognitive overhead, another place you need to think about what might happen, so as a codebase adds more of them it becomes more difficult to understand them all, and it becomes more challenging to understand subtle nuances of parallel execution.

So, at the simplest end of the spectrum, you have callback-based concurrency.  Every time you have to proceed to the next step of a concurrent operation, you have to create a new function and new scope, and pass it to the operation so that the appropriate function will be called when the operation completes.  This is very explicit and reasonably straightforward to debug and test, but it can be tedious and overly verbose, especially in Python where you have to think up a new function name and argument list for every step.  The extra lines for the function definition and return statement can be an impediment to quickly understanding the code's intentions, so what facilitates understanding of the concurrency model can inhibit understanding of the code's actual logical purpose, depending on how much concurrent stuff it has to do.  Twisted's Deferreds make this a bit easier than raw callback-passing without fundamentally changing the execution dynamic, so they're at this same level.

Then you have explicit concurrency, where every possible switch-point has to be labeled somehow.  This is yield-based coroutines, or inlineCallbacks, in Twisted.  This is more compact than using callbacks, but also more limiting.  For example, you can only resume a generator once, whereas you can run a callback multiple times.  However, for a logical flow of sequential concurrent steps, it reads very naturally, and is shorter, as it collapses out the 'def' and 'return' lines, and you have to think of at least two fewer names per step.

However, that very ease can be misleading.  You might gloss over a 'result = yield ...' more easily than a 'def whatever(result): return result; something(whatever)'.  Nevertheless, if you have 'yield's everywhere you might swap your stack, then when you have a concurrency bug, you can look at any given arbitrary chunk of code and know that you don't need any locks in it, as long as you can't see any yield statements.  Where you do see yield statements, you know that you have some code that needs to be inspected.

To continue down that spectrum, a cooperatively multithreading program with implicit context switches makes every line with any function call on it (or any line which might be a function call, like any operator which can be overridden by a special method) a possible, but not likely culprit.  Now when you have a concurrency bug you have to audit absolutely every line of code you've got, although you still have a few clues which will help you narrow it down and rule out certain areas of the code.  For example, you can guess that it would be pathological for 'x = []; ...; x.append(y)' to context switch. (Although, given arbitrary introspection craziness, it is still possible, depending on what "..." is.)  This is way more lines than you have to consider with yield, although with some discipline it can be kept manageable.  However, experience has taught me that "with some discipline" is a code phrase for "almost never, on real-life programming projects".

All the way at the end of the spectrum of course you have preemptive multithreading, where every line of code is a mind-destroying death-trap hiding every possible concurrency peril you could imagine, and anything could happen at any time.  When you encounter a concurrency bug you have to give up and just try to drink your sorrows away.  Or just change random stuff in your 'settings.py' until it starts working, or something.  I never really did get comfortable in that style.  With some discipline, you can manage this problem by never manipulating shared state, and only transferring data via safe queueing mechanisms, but... there's that phrase again.

Some programming languages, like Erlang, support efficient preemptive processes with state isolation and built-in super-cheap super-fast queues to transfer immutable values.  (Some other languages call these "threads" anyway, even though I would agree with Erlang's classification as "processes".)  That's a different programming model entirely though, with its own advantages and challenges, which doesn't land neatly on this spectrum; if I'm talking about left and right here, Erlang and friends are somewhere above or below.  I'm just describing Python and its ilk, where threads give you a big pile of shared, mutable state, and you are constantly tempted to splash said state all over your program.

Personally I like Twisted's style best; the thing that you yield is itself an object whose state can be inspected, and you can write callback-based or yield-based code as each specific context merits.  My opinion on this has shifted over time, but currently I find that it's best to have a core which is written in the super-explicit callback-based approach with no coroutines at all, and then high-level application logic which wraps that core using yield-based coroutines (@inlineCallbacks, for Twisted fans).

I hope that in a future post, I may explain why, but that would take more words than I've got in me tonight.

I'm Sorry It's Come To This

If you want to be a great leader,
you must learn to follow the Tao.
Stop trying to control.
Let go of fixed plans and concepts,
and the world will govern itself.


I usually try not to get too political in my public persona – on blogs, twitter, IRC, mailing lists et cetera – and that's a conscious choice.

I work on open source software.  I have for the last ten years.  I am lucky enough to have founded a project of my own, but in open source, leaders are more beholden to their followers than vice versa.  I depend on people showing up to effectively work for me, for free, on a regular basis.  So, I try to avoid politics not because I don't have strong convictions (anyone who knows me personally can tell you that I certainly do) but because I don't want someone to avoid showing up and helping do some good in the world in one area, just because we might disagree in another.

This is a benefit of living in a free and democratic society: we have ways to dispute issues that we have strong feelings about, so we can cooperate on some things without having to agree on everything.  It's rarely perfect but we can usually get some good stuff done, with rough consensus and running code.

Today though, there's a political issue which I can't ignore.  The purpose of Twisted (the open source project which I founded) is to facilitate the transfer of information across the Internet.  A new law, SOPA, is threatening to radically alter the legal infrastructure of the Internet in the United States, granting sweeping new powers to copyright cartels and fundamentally restricting the legal right to transfer any information, and to build tools that transfer it.  Twisted is designed to make it easy to implement new protocols, to easily experiment with improvements to systems like the Domain Name System.  SOPA might well make those potential improvements, and with only a little paranoid fantasizing, Twisted itself, illegal.

It's my view that this law is a blatantly unconstitutional restriction on free speech.  It will kill job creation, at a time when our nation can scarce afford another blow to its economy.  It will create the infrastructure to suppress political dissent, similar to the infrastructure in China and Syria, at a time when our corrupt political system needs dissent more than ever.  It is the wrong thing at the wrong time.

This bill is being discussed in the house today.  If you're in the US, call your representative right now.

(As always, I don't speak for anyone but myself; no one else has reviewed or endorsed these remarks.)