The Z-List

Monday October 23, 2006

If one reads the trade press or weblogs these days, one is certain to periodically run across irrational exuberance about the new medium of blogging, the death of print media, and so forth. "A-list" bloggers are the new "mainstream", markets are conversations and not demographics, etc, etc.

A-list bloggers' status varies quite a bit. Some are pundits with related jobs in various industries, some are journalists or otherwise already compensated for their blogging activities, and some are just really enthusiastic pro-am writers. They do have some things in common though. They update frequently, take advantage of the medium by providing lots of links to other contemporary commentary, and are generally linked together into the hypernode affectionately known as the "blogosphere".

This isn't where I've personally found value in the new medium though. The blogs that I've been interested in are those with small readership, that update infrequently, and don't link very heavily. That last point is important: linking is helpful to cite sources or more deeply understand a topic, but a blog that consists mainly of links to other things (especially links to other blogs) is not really helpful to me. If I want to keep up with a community, I subscribe to a few "planet" sites (especially the obvious one) so I've already got most of the links I need.

Coincidentally these blogs all happen to be written by people I know personally. It's one thing to get a whiff of an interesting idea in conversation; it's quite another to have it spelled out in detail in several pages of prose.

What better way to wrap up a post about how links to blogs aren't useful than with a list of links to blogs? This is an excerpt of my personal reading list that updates infrequently but have some interesting and, in my opinion, insightful things to say. Don't expect to see a lot of traffic here, but you'll generally be in for a nice surprise when some new stuff arrives in your aggregator from one of these directions...

The excitingly titled "r0ml.net", r0ml's (my father's) blog.

"the Onda", written by the founder of Tabblo

Stephen Walli's "Once More unto the Breach"

If you're tired of the rather obvious "software commentary" category, see Tenth's livejournal, which is equally interesting but meanders to a wider array of topics. Sometimes he comments on software too, but it's ... different.

Pop Quiz, Hot Shot

Friday September 22, 2006

Anyone care to guess when this happens?

(Hint: ENOTCONN is not a documented errno from stat(2)).

Fullmetal Asynchronous

Sunday September 17, 2006

Last week I checked in a branch which implements the kernel of Twisted in Java, including the core and TCP server portions of the reactor API and an AMP IProtocol implementation.

A very small part of this code was "serious work", for establishing a bridge between a Java server and a Python client; however, the actual protocol being spoken was not AMP, I merely needed to decode some AMP data in Java. The rest of it was written over the weekend on a lark, and to see how much Java I remembered from the days of my youth, to see what it's like to write a Java application these days.

It may seem pretty odd that I'm doing hacking in Java for fun, especially after being a semiprofessional java-hater some years ago. Although I must concede that Java has made some strides in the last few years, my pythonista fans need not worry. I still find Java far more tedious to write than Python, and the development process more cumbersome. I'm not doing this work because I want to switch, but rather, because I want to give AMP some cross-language street cred.

AMP, the Asynchronous Messaging Protocol, is a new protocol that will be first released in the upcoming Twisted version 2.5. The idea behind AMP is fairly simple. In some ways, it's the same idea behind XML-RPC. The gist of the idea behind XML-RPC is that there are a large variety of applications which have need of a protocol library with some properties:

extremely simple - a simple protocol means you need to think about what you expose, which means compact, easy-to-understand interfaces.
language neutral - the protocol should have a reasonable set of data types, but not be tied to any specific programming language, to gather the widest possible adoption (and thereby provide the greatest value to implementors, as there will be more services they can access)
easy to deploy - you don't need to allocate a new server, you can just drop new XML-RPC responders in by creating a new file in most cases

These properties, it turns out, are amazingly useful for a huge class of applications, and XML-RPC services have cropped up all over the place.

Twisted's main audience is often substantially different from the originally target audience of XML-RPC. Although superficially both are building network services, traditional network services are well-served by ... well, traditional infrastructure, like Apache, MySQL and the plethora of web frameworks available to Python, Perl, PHP, Ruby, Lisp, and so on. Applications that are quirky, hard to classify, and have a bit of a "twist" need Twisted (no pun intended -- okay maybe a little). In fact, much as I wish it were not so, "normal" applications will often hit roadblocks in Twisted where they won't on other platforms, because so much more effort has been spent on edge cases and behavior in exceptional situations than a smooth transition for "average" usage.

Although, for the reasons given above, Twisted applications defy genre to some extent, they do have a few features in common. They are network applications, after all. Many are message-routing or custom control panel applications, which require a two-way monitoring, streaming, or control protocol.

bit-for-bit correct - you can send chunks of data without error-prone encoding or escaping them in a format like base64. Typical applications exchange small chunks of text-based data. Many Twisted applications are exchanging media chunks or encrypted packets, where a botched newline conversion or mismatched codec is a catastrophic failure rather than an intermittent annoyance. Anyone who has used an XML library in the real world can attest to the difficulty of getting this right.
two-way - in many cases, notifications need to be generated and pushed down an active connection, rather than the request/response model offered by XMLRPC. In principle, of course, this is possible with an HTTP-based protocol as well as with a connection-oriented one, but in practice the presence of NATs and firewalls will prevent incoming connections in many situations where outgoing connections are easy.
asynchronicity - the ability to send and receive messages while outstanding requests are pending. The nature of Twisted's event loop is, of course, suited to this, but not all protocols provide this. IMAP4, for example, has several states where the client must implicitly synchronize with the server and halt sending further requests until all of its outstanding requests have been answered, regardless of whether it's possible to send those messages programmatically or not. (This is why I had to write a Reactor and a Deferred for Java - it's very hard to get real two-way communication going when you have to manage large groups of threads.)

For quite a while now, PB, or "Perspective Broker" has been providing Twisted users with the latter triad of properties, and more. PB has lots of fun features: distributed garbage collection, string table compression, a pluggable object marshalling framework, dynamic proxies, authentication support, and remote error reporting.

In case that buzzword stew didn't clue you in, PB doesn't exactly have the "extremely simple" property. Unfortunately, nor is it language neutral in a practical way. There are Emacs Lisp, Scheme, and Java implementations of the protocol's encoding, and it was possible to write cross-language applications, but in practice it is prohibitively difficult to declare or replicate all the implicit nuances of Python objects serialized via PB. As an example, most Python objects are dictionaries, and the PB protocol reflects this - it is possible to have an object's state be a dictionary that refers to itself . That configuration is an impossibility in most languages. In some ways, this is a strength; you can simply assemble your objects and ship them over the network, which saves a lot of time during development, especially in protocols which need to communicate about a wide variety of different types.

In short, PB is great for the applications where you need it. It works especially well for prototyping games, and could probably be optimized into an awesome protocol even in production, and it can serve in the cases where you don't really need all of its power. However, over years of working with PB and discussing it with others who use Twisted, I've discovered that there are many applications that need a protocol with both the first, and the second set of properties. There's nothing inherent about wanting to send asynchronous notifications which requires exact, implicit over-the-wire copying of any type of object. It seems to me that there's a fairly widespread need - at least widespread among Twisted users - for a protocol that falls somewhere between PB's incredible power and XML-RPC's militant simplicity.

AMP is an attempt to build that protocol. It is a simple protocol that bridges both sets of requirements. It has some features of PB. Messages are tagged with request/response headers, so that you can easily encapsulate one into a Deferred. It has some features of XML-RPC. Only a few, very simple parameter types are described in the protocol itself, and there's an extremely simple low-level encoding (far simpler than XML) so that you don't have to implement very much at all to get something useful.

A mediocre programmer like yours truly can implement AMP in an hour or two in a language that they are familiar with. It was designed to be language neutral - and this brings me back to the Java implementation - I think that the Java implementation is a good proof of concept that an AMP protocol can be equally idiomatic in non-Python languages. In Python, this is what defining and sending a command to increase the volume on a media player might look like:

class IncreaseVolume:
    commandName = 'IncreaseVolume'
    arguments = [('howMuch', Integer()]
    response = [('currentVolume', Integer())]
# ...
D = ampInstance.callRemote(IncreaseVolume, howMuch=4)
def showVol(result):
    print 'Current volume:', result['currentVolume'] D.addCallback(showVol)

in Java, this is a bit more verbose, but I don't feel that Java is a second-class citizen in AMPtown, merely that the language itself is more verbose:

class IncreaseVolume {
class Arguments { public int howMuch; }
class Response { public int currentVolume; }
}
// ...
IncreaseVolume.Arguments ae = new IncreaseVolume.Arguments();
ae.howMuch = 7;
ampInstance.callRemote("IncreaseVolume", ae, new IncreaseVolume.Response())
    .addCallback(new Deferred.Callback(Object response) {
        System.out.println("Current volume: " +
          ((IncreaseVolume.Response) response).currentVolume);
    });

I hope that in the coming year, AMP will enable a new generation of networked applications. I know that Divmod will be using it for a few. It's dead simple to write a custom chat protocol with it. For example, a whiteboarding application which wanted to mix user text messages with application control messages. Other two-way, asynchronous applications include custom GUIs that update when data is changed, lightweight notification of "push" software or RSS-feed updates, and monitoring software that aggregates and reports on the status of large numbers of devices in real time. It's as hard to give a complete survey of the list of possible AMP applications as it would be to give a complete list of possible socket applications, and I'm sure that the really interesting ones are the ones that haven't occurred to me yet.

In the nearer term, the code that spawned this post can be used for a lightweight bridge between Java and Python applications which need to exchange a small set of messages. The few initial users who are looking at this will probably be using it in a configuration where a Java application needs a feature from Twisted which is not very well supported in Java-land. Virtualizable IMAP4 server support is one such example, but there are others. Rather than try to convince an SVN snapshot of Jython to run Twisted's mainloop, you can run the IMAP4 code in a subprocess from Java and communicate with it over a socket, exchanging only the necessary messages. Since there's a reactor in Java as well as one in Python, the code looks fairly similar at both ends of the wire and the impedance mismatch is minimized. The java reactor can happily run in a thread, and need not interfere with existing Java code. The same is true if you need to make use of a Java library from a Twisted application.

As more AMP implementations are done, these kinds of integrations can be used for multi-language plugin systems. Rather than defining a thick API and loading various scripting systems as shared libraries, your application can define a simple AMP protocol and run subprocesses which are expected to speak it. I believe that would substantially reduce the complexity of most application-scripting systems that want to support multiple languages. That would also let you turn an existing Python scriptability system into one which could connect to an arbitrary language, by spawning that process from python code.

Finally, AMP in more languages means a way to spread the asynchronous Twisted gospel beyond the confines of the Python implementation itself. Python is great, but there are a lot of programmers out there writing a lot of code in a lot of other languages. I think that all of it could be improved by moving away from bundles of threads and back towards a simple, unified, event-driven model.

Let's Get Pumped

Sunday September 10, 2006

I want to be excited about the python 3000 effort. Every programmer loves a green field project; there's none of that icky legacy stuff holding you back, and you can have a beautiful, graceful new creation that exceeds the limitations of its hobbled and mis-designed ancestors. Python 3000 could fix all of Python's warts, giving us a clean, simple language with more power and flexibility.

However, the "icky legacy stuff" happens to include every program I've written in the last 5 years, as well as really important functionality like GTK bindings, database support, and compatibility with applications (like GIMP, Gaim and Blender) which embed Python themselves. so I need something more substantial to get excited about. Something to make it worth the rather onerous effort of upgrading the Twisted codebase, and simultaneously breaking support for millions of existing Python installations. Some of the "substantial" new features I've seen, like the new "iostack" library, seem to be controversial. I haven't done a thorough code review myself, but a few comments on the mailing list, like "8k is a perfect buffer size for any application", suggest that while there may be improvements, these changes are far from problem-free. (Not to mention the fact that a large portion of what iostack is trying to accomplish is the sort of thing that would be features in Twisted anyway, so it's likely we won't be using much of that code...)

PyPy brings some of this effort along with it as well, but PyPy's advantages are much clearer: it will be about a zillion times faster, it will make writing bindings to existing native functionality fundamentally easier, and it might be possible to add core language interpreter features, like restricted execution, without having to patch the core itself. Also, PyPy is currently targeting fundamentally the same language as Python 2.4, whereas Python 3000 is intentionally incompatible, so it will be possible to support Python 2.4 and PyPy, although PyPy may require a lot of conditional blocks to work right in the real world.

This is all a high-level understanding gathered from listening to rumors, perusing mailing list archives, reading a few websites, and attempting to read between the lines. I could be wrong about both projects. With my current understanding though, the plan for the Twisted project is to support the 2.x Python series and PyPy as soon as it's feasible, but ignore Py3k until there is a compatibility layer which would allow us to migrate gradually rather than in one fell swoop.

Lots of Python fans seem to read this blog, so maybe you can help me out. What new features or idioms should I be really excited about in Python 3000? Am I missing something fundamental about what it's trying to achieve?

(Those of you who don't have an OpenID login can feel free to answer this question by sending email to glyph@divmod.com rather than posting a comment.)

Scientific Method(ology)

Friday August 25, 2006

I'm an empiricist at heart. It's hard to integrate the principles of scientific discovery into one's daily life; while it makes for a good ontological basis for existence, it has the unfortunate side-effect of being damned expensive to apply consistently. Still, I try when I can.

Recently it occurred to me that UQDS (Divmod's development methodology) has a striking similarity to the scientific process. I can't find something that neatly lays it out like I remember learning it in school, but wikipedia has a very nice treatment of the whole process.

As I understand the process of scientific knowledge acquisition, there are four phases:

A hypothesis is developed, which is an idea that might possibly be true.
An experiment is designed, and performed, to test the hypothesis. if the experiment fails, a new hypothesis is designed.
The experiment is documented in a paper and published, and the publication is subjected to peer review by other scientists; the experiment is repeated. if the experiment can't be repeated, the experiment must be re-designed to eliminate errors or an alternative hypothesis designed to take errors into account.
The hypothesis is accepted as a fact, which may be later be used to develop a model, theory, or even a law.

I know a few of my scientist friends out there might read this, so I want to be clear that I'm not proposing that this is what scientists do: scientists do a lot of things. this is the merest subset of the process required for things to be accepted as "scientific fact", and only in the abstract sense. Different scientific communities have different official standards. Still, any new scientific discipline would probably have to start with these rules first to be considered "science", and then probably develop additional ones later.

UQDS corresponds rather directly, as if a code repository were its own particular branch of science.

hypothesis: a ticket is created, which is a feature which may be possible to implement correctly.
experiment: a branch is developed, which includes test cases. here, as the test cases fail, the branch is refined and the ticket adjusted, much as the hypothesis must be adjusted.
peer review: well, uh... peer review. Another developer reviews the branch, and verifies that the tests test the hypothesis, runs the tests (replicates the experiment) to verify that they pass for them as well, possibly in a different environment, and reports their findings as well. If the tests fail, the branch must be adjusted more.
fact: the changes are then accepted for merging, where they are incorporated into the codebase, which corresponds to the scientific body of knowledge.