Some Common Onomatological Errors

The open-source event-driven networking engine that I work on is called "Twisted".  If you're uncomfortable using something that sounds like an adjective in a place where a noun should go, the following noun phrases are equivalent:
  1. the Twisted project
  2. the Twisted engine
  3. the Twisted networking engine
  4. the Twisted framework
The unofficial group (of which I am a member) which works on that software is known as "Twisted Matrix Laboratories", sometimes shortened to "Twisted Matrix Labs" or "TMLabs".

I can understand that there is some confusion around this stuff, since these words often appear in close proximity, but to my knowledge there is nothing called "Python Twisted", "Twisted Python", or "Twisted Matrix".  There's "python-twisted", which is the package name that some operating systems use to package Twisted.  There is also "twisted.python", which is a python  package within Twisted itself.  Finally there is "twisted-python@twistedmatrix.com", which is the mailing list for discussing Twisted stuff in the Python programming language.  (This discussion list is so named to distinguish it from the possibility of not-quite-hypothetical discussion of Twisted implemented in other languages, although no other implementations are currently actively maintained.)

I just thought you'd all like to know that.  That is all.  (For now, anyway.)

Learn Twisted

Jean-Paul Calderone continues his excellent "Twisted Web In 60 Seconds" tutorial series.  If you haven't checked it out yet, you should!

Do you want WiFi to work at your conference?

I've been pretty busy for the last couple of weeks, so I've just had an opportunity to catch up with blog posts that have been piling up.  In particular I noticed this one: The “WiFi At Conferences” Problem, by Joel Spolsky.

Joel has a lot of what look like good recommendations.  However, I can provide a much-abridged list.

Some years, WiFi access at PyCon US has been provided by the venue, or by a contractor whose name I mercifully do not know.  Those years, it has not worked.  Some years, it has been provided, or at least managed, by tummy.com.  Those years, it has worked.  They are probably much more critical of their own efforts than I am, as you can see in this thorough write-up that they did of PyCon's 2008 WiFi situation.

My two-step plan for you if you want your conference to have working WiFi access at your conference is:
  1. e-mail somebody at tummy.com, telling them that you want a working wireless network, and
  2. give them whatever they ask for.
If you do these things, then when people open their laptops at your conference, their networks will work.

Hobgoblin History

I like Terry Jones; I think FluidDB has a lot of potential.  But, sometimes when he's talking about it, he gets a little carried away and forgets that the rest of us don't live in his future yet.  In his latest missive on the official FluidDB blog, "Digital Hobgoblins", he describes some of the problems that FluidDB sets out to solve.

The problem is, I already have solutions for all of these problems, and I don't quite understand why they don't (or shouldn't) work for me.  (Since he organizes the post in terms of problems that existing systems have, I'm going to take the liberty of re-labeling these in terms of the problems that he seems to be describing rather than the lead text he used.  Please post a comment if you think my labeling is wrong.)

In existing systems, Terry says:

"Things must be named, and have one name."  Specifically, Terry calls out file systems.  Except... file systems have lots of ways of introducing multiple names for the same thing.  Symbolic links.  Hard links, if you really want to allow for ambiguity.  If you want to track that ambiguity, Windows "shortcuts" and MacOS "aliases" can do that.  Overlay mounts, loopback mounts and chroot execution allow for semi-arbitrary renaming.  Lots of other systems support this, too.  Database systems have a specific provision for multiple names: the many-to-one relation.  Any programming language with pass-by-reference data structures allows for some level of multiple-naming.  In fact, there's a whole discipline for allowing things to have lots of different names: indexing.  Anywhere you have a full-text index or an object where multiple attributes are indexed in some kind of database, you've got objects with more than one name.

"You have to be consistent and unambiguous."  As I mentioned on the first point, there are lots of ways to be slightly ambiguous at a human level.  You can refer to the same thing by different names, or, with mutable binding, you can refer to the different things with the same name.  In some circumstances, you must be precise, but that's because fundamentally, algorithmic thinking requries a certain level of precision, not because of any specific problem with computers.  In fact, there is a word for inconsistency and ambiguity in programming languages: polymorphism.  Any time you invoke an interface rather than a concrete implementation (which is to say any time you do anything in a dynamic language like Python) you are being ambiguous and potentially inconsistent in your program's behavior.

"You only get one way to organize stuff."  This is a pretty weak point, though, given that Terry himself immediately turns around and notices that tagging and other multiply-indexed database systems are becoming popular.  So he gives us two examples of exceptions, but no examples of the rule.  I'm not sure what I could add to that.

"Programmers are obsessed with "meaning"."  On this one, I'm going to agree, except I don't think it's a problem.  In the computational world, we are obsessed with the meaning of data, because if you get the meaning of the inputs wrong, then the meaning of the outputs is wrong too.  For example: if you have a number that represents the total liabilities that your company has accumulated, it's pretty important that you don't ever treat that as your total profit.  At a deeper level, if you have a sequence of bits that represents a floating-point number, it's important to know about its intended meaning, and not treat it as a string of characters, unless what you really want is a string.  "@H=N" is not as useful a concept as "3.1287417411804199" if you are trying to add it to something.  For what it's worth, I have my own, similar take on how we should treat computational objects that have multiple meanings: Imaginary. Even systems like Imaginary and FluidDB depend on a very rigid definition of some simpler concepts, like numbers consistently being numbers and words consistently being words.  In my view, even if we treat the book itself as multifaceted, it's important to know what the data representing the "readable object" part of a book is really "about", and make sure it stays distinct from the data representing the "paperweight" part of the book.  To be fair, FluidDB appears to do this itself — and this terminology is my least-favorite part of FluidDB — by having single-purpose, permission-controlled "objects" just like every other system, but calling them "tags", and re-using the word "objects" to refer instead to what others might call a "UUID" or "central index".  In Imaginary, the system is similar; although the centrality of the FluidDB "object" (in Imaginary's case, the "Thing") is less stark; using FluidDB's terminology, in Imaginary, a "tag" can have a "tag" of its own; in fact, there's nothing but tags ("Items") anywhere.

"Metadata is separated from the data it describes."  This may be true in some systems, but the web is probably the system with the most data in it anywhere, and in that system, metadata is always available as part of the request and the response.  You can put in any headers you want in the response, and there are lots of pieces of metadata (like content-type) which are almost always found along with the data.  In my opinion, the problem is more that we don't have enough of the previous problem.  Web developers haven't been obsessed enough with meaning: there aren't enough useful conventions around the HTTP request/response metadata, and so it's hard to bundle more metadata in with your response and have it faithfully propagated elsewhere.  We don't know what arbitrary headers might mean, because we don't have any way of expressing a schema for them.

Terry says he's going to write more about these problems, and the solutions that FluidDB provides for them.  I'm looking forward to it.  As part of that, I'd really like to see a clear description of how these problems affect me, or someone I know, either as a programmer or as a user.  What do I, or should I, really want to do with some application right now that these five problems are preventing me from doing?

The reason I felt compelled to write about this is that history — and particularly the history of websites like freshmeat and sourceforge — is littered with the corpses of projects which promised to fundamentally change the way we represent data.  A common problem with these projects is that they have expansive denunciations of current techniques to represent data, or manage persistence, and claim to provide an advance so significant that they will displace all current applications.  What most of the people working on these projects don't realize is that the current techniques for representing data have a history, and there are good reasons for their limitations.  Granted, not all of those reasons are currently relevant, and many are examples of path dependence, but it's still important to understand the reasons in order to escape the problems.

In FluidDB's case, I think that the problem isn't so much that Terry doesn't have the historical perspective, but that he assumes that we all do.  And that we can all make the cognitive leap to see why FluidDB is necessary.  But if I can't do it, I have to assume there are at least a few other programmers who aren't getting the message either.

Diesel: A Case Study In That Thing I Just Said

Thanks to jamwt for the shout-out on the announcement of Diesel.

Since the reaction to my reaction to tornado was so good (or at least so ... energetic), I figure I should comment on Diesel as well.  Spoiler alert: my reaction is ... largely similar, but since jamwt has been kind of nice to Twisted in the past, and didn't actually say anything mean this time, I'm somewhat reluctant to have that reaction.  Nevertheless, I swore a solemn oath to tell it like it is, keep it real, and soforth.  So I must.

Once again, I'm happy that event-driven programming is getting some love.  This time, I'm pleased that nobody is saying anything especially snarky or FUD-ish about Twisted.  I do feel like it's a little weird not to mention Twisted, or include some comparisons to Nevow or Orbited, both of which provide different, comprehensive approaches to COMET with Twisted.

(Worth noting: Orbited also originally started out using its own event-driven I/O layer, but switched to Twisted later, because Twisted is "crazy delicious".)

Diesel has many more interesting ideas at the level of async I/O than Tornado did.  I think the generator-based approach for implementing protocols is interesting and deserves some more exploration.  I'm not sold on it for every use-case, and I think the implementation might have some flaws, but it definitely has some advantages.

I'd give jamwt a hard time for not reporting issues and communicating with Twisted more before re-writing the core, but for three issues:
  1. jamwt's been around in the Twisted community for a while.  He's written a bunch of fairly deep Twisted code and he clearly knows what the framework is capable of.
  2. I've spoken with him on a number of occasions, and for all I know I might have discussed this with him.  I don't remember it, but it would be pretty embarrassing to write a big rant about how nobody talks to us only to have him paste some chat log where he explained why he was writing Diesel six months ago, and I said "oh, okay" ;-).
  3. Nobody is calling Twisted names or making vague, unsubstantiated accusations.  You're not obligated to examine Twisted, nor Nevow, nor Orbited, I just feel that you owe us some explanation if you publicly say that you tried it and found it wanting.  The tone on the Diesel announcement, in its one brief mention of Twisted, is "we tried it, but we kinda wanted to do our own thing".  So, good for them, they did their own thing, I hope they had fun.
Now, personally, I'd like to leave it at that, but there is a certain inevitable comparison that I think is going to take place.  Diesel has a nicer web page than Twisted.  They have entwittered ... twitified ... uh ... tweetened ... the project, and we haven't; we just have an old-fashioned "blog".  Diesel is smaller than Twisted, so it's easier to explain, and so the people approaching it will have a better idea of its scope.  This might give the immediate impression that it is a simpler, better, more "modern" replacement for Twisted's I/O layer, and this is not the case.  So I still feel it's important that I set the record straight.

Before I launch into my critique, I should say that I don't want to harsh on Diesel too bad. It's a neat little hack and you should go play with it.  And I feel bad pointing out problems with it, since as I mentioned above, nobody's dumping on Twisted.  So, Diesel fans, please take this in the spirit of a frank code-review, not a complaint about your behavior.

The interesting generator-munging bits could be easily adapted to run on top of Twisted's loop, which, arguably, they should have been in the first place; and the toy "hub" that they've written might be good enough for some simple applications where reliability under load is not a serious concern.  In fact, inlineCallbacks might provide a good deal of what is needed to support Diesel's programming style.  Alternately, Diesel might provide some hints as to how things like inlineCallbacks could be made more efficient.

That said, Diesel's I/O loop sucks.

It's disappointing to see the same mistakes getting made over and over again.  First and foremost: no tests.  Come on, Python community!  You can do better!  Write your damn tests first!

The #1 benefit that a brand-new I/O loop project could have over Twisted is that Twisted was written in the bad old days before everybody knew that TDD was the right way to write programs, so we don't have 100% test coverage.  But, we strive to get closer every day, while every new project decides that they don't need no stinking quality control.

Predictably, as it has no tests, Diesel's I/O layer is full of dead code, inaccurate  documentation, and unhandled errors.  Consider this gem, which I found about 30 seconds into reading the code: KqueueEventHub is documented to be "an epoll-based event hub", and its initializer defines an inner function which is never used.  I'm not going to belabor the point by enumerating all the typo bugs I found, but you may find the output of 'pyflakes diesel' interesting.

Instead of Tornado's inaccurate handling of EINTR, Diesel has no handling of EINTR, as far as I can tell.  It also doesn't handle EPERM, ENOBUFS, EMFILE, or even EAGAIN on accept().  To be fair, it has a catch-all exception handler all the way at the top of the stack, so none of these will cause instant crashes, but they will cause surprising behavior in odd situations (and possibly infinite traceback-spewing loops).

More surprisingly - I had to re-read the code about five times to make sure - it doesn't appear that sockets are ever set to be non-blocking, and EAGAIN is not handled from accept(), recv(), or send().  And yes, this can happen even if your multiplexor says your socket is ready for reading and/or writing.  The conditions are somewhat obscure, but nevertheless they do happen.  So, occasionally, Diesel will hiccup and block until some slow network client manages to send or receive some traffic.  In other words: Diesel is not really async.  It just fakes it convincingly, most of the time.

Once again, there's no way to asynchronously spawn a process, and no way to asynchronously connect a TCP client.  Sure, this looks like an asynchronous connect call, but it's misleading: it blocks on resolving the hostname, and it potentially blocks on the initial SYN/ACK/SYN+ACK exchange.  There's no asynchronous SSL support.  And no, that is not trivial.  Not to mention handling all the crazy errors that spew out of the Windows TCP stack.  And since the loop is implemented to be incompatible with Twisted, it's not obviously trivial to compatibly plug it in and get those features.

Again, I don't want to dump on Diesel here; for what it is, i.e. an experiment in how to idiomatically structure asynchronous applications, it's all right.  For that matter Twisted has its fair share of bugs too, which would be pretty easy to lay out in a similar post; you wouldn't even need to do the research yourself, just go look at our bug tracker.

But both Diesel and Tornado make the mistake of attempting to replace the years of trial-and-error, years of testing discipline, and years of portability and feature work that Twisted has accumulated with a few oversimplified, untested hacks.

What they could have done is contributed any extensions that they needed to Twisted's loop, or modifications to Twisted's packaging that would allow them to get a smaller sliver of Twisted's core to bootstrap, if that's what they needed.

My goal in pointing out all these flaws is not to illustrate any particular point about Diesel, but to reinforce the point I implicitly made in my Tornado post, which is that if you try to write a new mainloop (especially without tests) you will screw it up.  You will most likely screw it up in ways which will only surface later, under mysterious circumstances, when your servers are under load and you are under the gun for a deadline.

Or if I happen to get wind of it and write a blog post about it, of course.  Then you get to cheat a little.

It's not an indictment of Diesel that it screwed this up; everyone screws it up.  I would probably screw it up, if I didn't have Twisted sitting in front of me as a direct reference.  POSIX by itself is unreasonably subtle and difficult, but POSIX, plus the subtle variations in different platforms which implement it, plus the Windows APIs which are almost-but-not-quite-exactly-nothing-like the POSIX APIs, presents an inhuman challenge.

Hopefully Diesel will grow some tests.  Hopefully it will fix, or better yet shed, its somewhat unfortunate I/O hub.  I am hopeful that someone will follow Dustin's excellent lead (perhaps Dustin himself!) and port Diesel's API and generator system over to Twisted's I/O architecture and eliminate all these silly bugs.  Of course, it someone did that, you could use Dustin's tornado port with Diesel.

With the silly bugs from the I/O loop out of the way, the Diesel team can write tests for the more interesting pieces, and fix the bugs which aren't entirely silly :-).