Conference FAIL

Wednesday July 09, 2008

Last night at a dinner with Ivan Krstić and Itamar Shtull-Trauring, we were all lamenting that too many (all?) software conferences focus specifically on positive results. This is what you want, of course, if you treat a conference as purely a marketing venue. However, most learning takes place based on something that someone did wrong and then needed to correct, not something that they did right.

All of the great software developers I know have at least one great story of how a project they were working on was a complete disaster. Often these projects are shielded from the public eye, since nobody wants to talk about failure. So, how do we make a public discussion of these ideas socially acceptable?

Thus, an idea was born: FAILcon. The idea is simple: submitted talks and papers must be related to projects which failed in an interesting way. The larger the better, of course — the bigger they are, the harder they fail — but anything that failed in an interesting way would be a valid subject for discussion.

I'm writing about it so that it won't be forgotten, because I think it's a great idea. But I doubt that any of us are going to organize a conference any time soon. So please, steal this idea. Does anyone out there with conference-organizing skills want to get something together based around the common theme of failure?

Static On The Wire

Friday July 04, 2008

I am, as you might have guessed, a big fan of dynamic typing. Yet, two prominent systems I've designed, the Axiom object database and the Asynchronous Messaging Protocol (AMP) have required systems for explicit declarations of types: at a glance, static typing. Have I gone crazy? Am I pining for my glory days as a Java programmer? What's wrong with me?

I believe the economics of in-memory and on-the-wire data structures are very, very different. In-memory structures are cheap to create and cheap to fix if you got their structure wrong. Static typing to ensure their correctness is wasted effort. On the other hand, while on-the-wire data structures (data structures which you exchange with other programs) can be equally cheap to create, they can be exponentially more expensive to maintain.

When you have an in-memory data structure, it's remarkably flexible. It is, almost by definition, going to be thrown away, so you can afford to change how it will be represented in subsequent runs of your program. So, when your compiler complains at you for getting the static type declarations wrong, it's just wasting your time. You have to write unit tests anyway, and static typing makes unit testing harder. What if you want a test that fakes just the method foo on an interface which also requires baz, boz, and qux, so you can quickly test a caller of foo and move on? A really good static type system will just figure that out for you, but it probably needs to analyze your whole program to do it. Most "statically typed" languages — such as the ones that actually exist — will force you to write a huge mess of extra code which doesn't actually do anything, just so all your round pegs can pretend to fit into square holes well enough to get your job done.

But I don't have to convince you, dear reader. I'm sure the audience of this blog is already deeply religious on this issue, and they've got my religion. I'm just trying to make sure you understand I'm not insane when I get to this next part.

The most important thing that I said about in-memory data structures, above, is that you throw them away. It's important enough that I'll repeat it a third time, for emphasis: you throw them away. As it so happens, the inverse is the most important property of an on-the-wire data structure. You can't throw it away. You have to live with it.

Forever.

Oh, sure, you told your customers that they all have to upgrade to protocol version 3.5, but they're still using 3.2. Unless you're Blizzard Entertainment, you can't tell them to download the new version every six weeks or go to hell. Even if you can do that (and statistically speaking, you probably aren't Blizzard Entertainment) you have to keep the old versions of the updater protocol around so that when version 4.0 comes out all the laggards who haven't even run your program since 3.0 can still manage to upgrade.

Here's the best part: your unit tests aren't going help you — at least, not in the same way they would with your in-memory data. When you change an in-memory data structure, you aren't supposed to have to change your unit tests. You want the behavior to stay the same, you don't change the tests; if they start failing, you know something is wrong. With your new protocol changes though, you can have tests for the old protocol, and tests for the new protocol, but every time you make a protocol change you need to a new test for every version of the protocol which you still support. Plus, you probably can't stop supporting older versions of the protocol (see above).

If you've got a message X[3], and you're introducing X[4], you have to make sure that X[4] can talk to X[3] and X[2] and X[1]. Each of those is potentially a new test. Each one is more work. Even worse, it's possible to introduce X[4] without realizing that you've done it! If you have a new, optional argument, let's call it "y", to a dynamically-typed protocol, your old tests (which didn't pass y) will pass. Your new tests (which do pass y, to the newly-modified X[4]) also pass. But there's a case which has now arisen which your tests did not detect: y could be passed to a client which only supports X[3], and an error occurs.

If this were some in-memory structures, that case no longer exists. There is no version of X currently in your code which cannot accept y. Your tests ensure that. You have to time-travel into the past for your unit tests to discover the code which would cause them to fail. You can't just do it once, either: maybe X[3] was designed to ignore all optional parameters. You have to consider X[2] and X[1]. You have to travel back to all points in time simultaneously.

This is why I said that the cost is exponential: you carry this cost forward with each new supported version that gets released. Of course, there are ways to reduce it. You can design your protocol such that arguments which your implementation doesn't understand are ignored. You can start adding version numbers to everything, or change the name of every message every time some part of its schema changes. All of these alternatives get tedious after a while.

So what does this have to do with static typing? Static type declarations can save you a lot of this work. For one thing, it becomes impossible to forget you're changing the protocol. Did you change the data's types? If so, you need to add a compatibility layer. These static type declarations give you key information: what do the previous versions of the protocol look like? More importantly, they give your code key information: is an automatic transformation between these two versions of the data format possible? (If not, is the manual transformation between these two versions correct?)

In a dynamically typed program, you can figure out your in-memory types are doing by running the debugger, inspecting the code that's calling them, and simply reading the code. Sometimes this can be a bit spread out — in a badly designed system, painfully spread out — but the key point is that all the information you need is right in front of you, in the source code. If you're working on code that is shipping data elsewhere without an explicit schema, you have to have a full copy of the revision history and some very fancy revision control tools telling you what the protocol looked like in the past. (Or, perhaps, what the protocol that some other piece of software has developed used to look like in the past.)

Your disk is another kind of wire. This one is particularly brutal, because while you might be able to tell someone to download a new client to be able to access a service, there is no way you are ever going to get away with saying "just delete all your data and start again. there's a new version of the format." When writing objects to disk (or to a database), you might not be talking across a network, but you're still talking to a different program. A later version of the one you're writing now. So these constraints all apply to Axiom just as they do to AMP; moreso, actually, because in the case of AMP all the translations can be very simple and ad-hoc, whereas in Axiom the translations between data types need to be specifically implemented as upgraders.

With a network involved, you also have to worry about an additional issue of security. One way to deal with this is by adding linguistic support to the notion of untrusted code running "somewhere else", but type declarations can provide some benefit as well. Let's say that you have a function that can be invoked by some networked code:

@myprotocol.expose()

  def biggger(number):

      return number * 1000

Seems simple, seems safe enough, right? 'number' is a number taken from the network, and you return a result to the network that is 1000 times bigger. But... what if 'number' were, instead, a list of 10,000 elements? Now you've just consumed a huge amount of memory and sent the caller 1000 times as much traffic as they've sent you. Dynamic typing allows the client side of the network connection to pass in whatever it wants.

Now, let's look at a slightly different implementation of that function:

@myprotocol.expose(number=int)

  def biggger(number):

      return number * 1000

Now, your protocol code has a critical hint that it needs to make this code secure. You might spell it differently ("arguments = [('number', Integer())]" comes to mind), but the idea is that the protocol code now knows: if 'number' is not an integer, don't bother to call this function. You can, of couse, add checks to make sure that all the methods you want to call on your arguments are safe, but that can get ugly quickly.

Let's break it down.

Static type declarations have a cost. You (probably) have to type a bunch of additional names for your types, which makes it difficult to write code quickly. Therefore it is preferable to avoid that cost.

All the information you need about the code at runtime is present when you're looking at your codebase. Therefore — although you may find its form more convenient — static type declarations don't provide any additional information about the code as it's running. However, information about the code on opposite ends of the wire may only be in your repository history, or it may not be in your code at all (it could be in a different codebase entirely). Therefore static typing provides additional information for the wire but not in memory.

At runtime, you only have to deal with one version of an object at a time. On the wire, you might need to deal with a few different versions simultaneously in the same process. Static type declarations provide your application with information it may need to interact with those older versions.

At runtime (at least in today's languages) you aren't worried about security inside your process. Enforcing type safety at compile time doesn't really add any security, especially with popular VMs like the JVM not bothering to enforce type constraints in the bytecode, only in the compiler. However, static type declarations can help the protocol implementation understand the expectations of the application code so that it does not get invoked with confusing or potentially dangerous values. Therefore static type declarations can add security on the wire while they can't add security in memory. (It turns out that if you care about security in memory, you need to do a bunch of other stuff, unrelated to type safety. When the rest of the world catches up to the E language I may need to revisit my ideas of how type safety help here.)

If you have data that's being sent to another program, you probably need static type declarations for that data. Or you need a lot of memory to store all those lists I'm about to multiply by 1000 on your server.

Constructive Criticism

Wednesday July 02, 2008

I frequently say that I'm a big fan of constructive criticism of Twisted, but I rarely get it. People either gush about how incredibly awesomely spectacularly awesome Twisted is, or they directionlessly rant about how much it sucks, but aside from a fairly small group of regulars who file issues on the Twisted tracker, I don't hear much in between.

I caught wind of (and responded to) some blog comments of the latter type (directionless ranting) from Lakin Wecker. After I responded, in an unusual response for someone writing such comments, he apologized and promised to do much better. He has responded with some much more specific and potentially constructive criticism, ominously entitled "twisted part 1".

Lakin, thanks for reformulating your complaints in a more productive way. I do think that some useful things might happen as a result of this article. While I don't necessarily agree with it, I do care about this type of criticism. In order to demonstrate my appreciation, I will try to make this a thorough reply.

It sounds like there are several mostly separate issues that you had here. I'll address them one at a time.

Twisted Mail

I believe that the main issue is that the twisted.mail API is missing some convenience functionality which will allow users to quickly build SMTP responders that deal with whole messages. This is definitely a shortcoming of twisted.mail.

However, this shortcoming is not entirely unintentional. In general, Twisted's interfaces encourage you to write code which scales to arbitrary volumes of input. IMessage is a thing that can receive a message, rather than a fully parsed in-memory message, because we want to encourage users to write servers that don't fall over. If you have to handle each line as it arrives, it's less likely that you'll die if you a message bigger than the memory of the machine that is running the server.

That's not to say that there shouldn't be some additional, higher-level interface which does what you want. Quotient, for example, uses twisted.mail, but provides a representation of a message which has all of its data written to disk first, and efficient APIs for accessing things like headers without fetching the whole message back into memory. twisted.mail almost provides something like this itself; if you poke around in twisted.mail.maildir and twisted.mail.mail, you'll find FileMessage (an implementation of a message which writes its contents to disk) and MaildirDirdbmDomain (an implementation of IDomain which uses a directory of maildirs to deliver messages). Not that these would not have been useful for your use case: they just show that we're happy to have higher-level stuff implemented within Twisted.

One function which might be cool to provide is something which will parse an incoming SMTP message and convert it to an email.Message.Message, then hand it off to some user code. Even better would be to integrate this with the command-line "twistd mail" tool, such that you could easily deploy such a class as an LMTP server or something like that.

Although we don't have all the pieces you need, there is also the ever-present issue of documentation of the pieces which we do have. Some of the code in twisted.mail might have been useful to you if its documentation had been better. For example, you might also notice some pretty strong similarities between twisted.mail.protocols.DomainDeliveryBase.receivedHeader and your own implementation of that method.

My main point here is that fixing this is a simple matter of programming (or, in the latter case, of documenting). I think that the best way to deal with that shortcoming is simply to submit patches to twisted.mail which add the functionality that you want. Lots of open source projects are like this: they were driven just far enough to satisfy their implementors' use-cases. twisted.mail is a perfectly functional and simple API if you want to build what it is designed to build.

When we're talking about "Twisted", we're typically talking about the core, and the programming model that comes with it. When you get into the specifics of an API like twisted.mail, twisted.names, and even twisted.web (maybe even especially twisted.web) you're going to find plenty of shortcomings and areas that it don't yet do what you need. There are some areas which are downright bad, and some which are so bad that they're embarrassing. We need volunteers to identify the areas that are lacking and add to them.

Twisted vs. Things Which Are Not Twisted

The reason that I disagree with your conclusion that Twisted as a whole is necessarily more complex, hard to explain, too dense, unreadable (etc, etc) is that the main thing to compare it to is shared-state multithreaded socket servers, or asyncore.

Here's a good example of what makes Twisted simple, at its core:

from twisted.internet.protocol import Protocol

  class Echo(Protocol):

    def dataReceived(self, data):

      self.transport.write(data)

This server supports a large number of clients. It supports TLS. It's cross-platform. It supports both inbound and outbound connections. And yet, including the import, it's only 4 lines of code. You can write a threaded version of this which appears to be just as short, but it's pretty much impossible to do without getting a half-dozen subtleties of either a socket API or a concurrency issue wrong.

For example, your example "smtp_helper.py". You don't provide any documentation of its concurrency properties, but the implementation of 'start' is almost certainly wrong. For one thing, starting the same TestSMTPServer twice, or even starting two completely different TestSMTPServers at the same time, will not work. Of course, you'd never do that, but let's say your SMTP client also used asyncore and a thread. Now you've got a client using socket_map in the main thread and a server using socket_map in another thread. Also, there's the fact that process_message may be called from an arbitrary thread; if it ever grew to do anything more complex than appending to a list, it would need its own serialization logic. This isn't something that could be fixed — the entire approach is wrong, and you would need to rewrite all of your tests to work completely differently in order to fix it. You'd need to asynchronously start both your client and your server, then have an API for letting your tests know when both of them are done. By the time you're doing that, you're practically implementing your own mini-Twisted, along with extensions to unittest that turn it into Trial.

Ironically, you can use Twisted to fix this problem. If you really like the API presented by the 'smtpd' module, you could write a wrapper which would make an asyncore dispatcher look like a Twisted protocol factory (or protocol), and hook asyncore into the main loop, then use 'trial' for your testing. How exactly one would implement such a thing is beyond the scope of this post, but it's not actually that hard; just look at the relatively few methods that asyncore.dispatcher calls on self.socket and you'll probably get the idea.

I feel that the comparison of "Twisted" versus "non-Twisted" code you've presented is a bit unfair. The Twisted example is a demonstration of utility functionality that Twisted Mail is missing, not a core idea that Twisted implements wrong. The code it is being compared to looks simple only because critical areas of correctness that would need to be addressed in a real system (and will probably eventually need to be addressed, if the test is maintained for a long time) are being completely ignored. The twisted example, if it fails, will fail relatively straightforwardly; the other example's failure mode will be an obscure traceback coming out of otherwise unrelated (but not thread-safe) code.

However, your subjective experience of some areas of Twisted being hard to understand and use is entirely valid. Your detailed description of why it was difficult for you has already been useful, but I hope you will stick around and help us improve the situation for future users as well.

Trial and Testing

Perhaps the more significant issue that you discovered while you were working on this is the subtle mystery of getting Twisted to fully shut down a connection and a bound port inside a test. This is really way too hard, and it is a problem which affects anyone who wants to use Trial for integration testing.

Although I'd really like to see this problem dealt with in a systematic way, and I'd like it to be easy as pie to write integration tests with trial, there is a reason that the issue hasn't been fixed. As the Twisted team has been improving our testing skills, we've been finding more and more that you absolutely need good unit tests before you can really write integration tests. Without unit tests, you don't know whether the individual pieces work, so they tend to break in surprising ways when you put them together. In Twisted itself we are still in the process of rehabilitating a very large, and very old hodgepodge of unit, functional, and integration tests to be broken down into smaller, more coherent unit tests. Until that process is finished, and trial has been tuned to be as good as possible for that sort of testing, integration testing isn't going to be a focus of any core developer.

I agree with the advice that you were given on IRC. We could eliminate the particular surprise of doing a clean connection shut-down in trial, and provide a good way to do it, but you'd still face issues with your tests where the SMTP API might be scheduling timed calls or doing other things behind your back which would be difficult to monitor or shut down. Talking to a mock message-sending implementation for starters would be a lot easier.

I can understand your concern about passing more parameters. Luckily, this is Python: you don't necessarily need to change the interface of the system you're testing. If you have a system, A, that depends on another system, B, to perform some of its work, you need to have a reference from A to B somewhere. That can be passed as a parameter, imported as an object, or loaded as a module. In Java, you'd need to change all your type declarations and do some kind of dependency injection magic, but in Python you can always cheat. The worst case in Python, after all, is that A imports B as a module. So, if you don't want to add any parameters, or even any attributes or methods, consider this:



  # A.py

  import B

  

  def stuff():

    B.functionFromB().otherStuff()

  

  # test_A.py

  import unittest

  import A

  import B

  

  class MyTest(unittest.TestCase):

  def
  functionFromB(self):

      result = B.functionFromB()

      # Modify the result for the test, if you like

      return result

  def setUp(self):

      A.B = self

    def tearDown(self):

      A.B = B

Some might consider this a bit gross, of course. It might be cleaner to add a specific API for plugging in a different implementation of B. However, it's useful to use this technique in cases — such as the one you described in your post — where you are trying to add some test coverage for an API which has already been written and you don't have control over.

I hope that digression helped, but I don't want to turn this into a screed about what you could have done better; let's consider your requirements as fixed (this needs to be an integration test) and look at what Twisted could have done better.

One thing the core team has been talking a lot about lately has been the development of verified test doubles. We don't have a lot of them, and we need more. For example, if you could pass a fake reactor to both your SMTP sender and receiver code, then you could manually make sure it was sending traffic at the appropriate times, to the appropriate hosts, and fail your test in sensible ways if it did something unexpected, rather than just having trial bomb out on you. This would also let you have regression tests to make sure that your code was working with the latest version of Twisted, in case the APIs in question changed. You wouldn't need your test to have a full, complete, clean shutdown of your SMTP connections because they would simply be garbage collected, as they would not be connected to the real reactor. You can see an example of what this might look like in twisted.internet.task.Clock. If someone contributed a real, documented, usable, verified test double for IReactorTCP, we would all be eternally grateful, especially if they could coalesce all the uses of the numerous half-assed attempts at it in our own test suite.

Something else we could do is write a supported factory wrapper which would allow the use of a real factory and connection in a trial test, but that would shut everything down cleanly at the connection level in tearDown. I would personally like this a lot, but I can't promise that it would be popular with the rest of the Twisted team. We all spend a lot of time trying to convince people to write unit tests before integration tests. I know that I'm a little concerned that providing great integration testing support will just lead to more people being confused by weird interactions in the guts of whatever protocol they're talking to. Eventually, however, integration tests can be useful, and I wrote the beginnings of the wrapper that I'm suggesting when I was writing tests for the AMP protocol. You might be able to use that as an example even if Twisted doesn't provide any public APIs for that sort of thing.

Conclusions

Unfortunately there's not much I can do immediately to fix the problems that you've had, Lakin. If someone with a similar level of Twisted experience attempts a similar task in the near future, it's likely that they'll hit the same issues. I barely (read: didn't actually) have the time to write this blog post, and I definitely don't have the time to fix the problems I've outlined.

While there are definitely some problems here, I don't think the situation is really all that bad. According to your post, learning enough about Twisted to do what you were doing and writing the Twisted version of this code took only 3 days. This learning curve is not as steep as some have accused Twisted of having. Presumably it would have taken someone already familiar with twisted.mail and trial much less time. It didn't take me much more than 2 minutes to read and understand it :-). As I mentioned above, your friend's threaded smtpd implementation has some pretty severe problems which might cause maintenance headaches later, whereas you were quite careful to do a proper shutdown (the trickiest thing to get right) in the Twisted version, so it is likely to be fairly robust going forward.

This Word, "Scaling"

Monday June 30, 2008

You keep using that word. I do not think it means what you think it means.
— Inigo Montoya

It seems that everyone on the blogosphere, including Divmod, is talking about "scaling" these days. I'd like to talk a bit about what we mean — and by "we" I mean both the Twisted community and Divmod, Inc., — when we talk about "scaling".

First, some background.

Google Versus Rails

Everyone knows that Scaling is a Good Thing. It's bad that Rails "doesn't scale" — see Twitter. It's good that the Google App Engine scales — see... well, Google. These facts are practically received wisdom in the recent web 2.0 interblag. The common definition of "scaling" which applies to these systems is the "ability to handle growing amounts of work in a graceful manner".

And yet (for all that I'd like to rag on Twitter), Twitter serves hojillions of users umptillions of bytes every month, and (despite significant growing pains) continues to grow. So in what sense does it "not scale"? While that's going on, Google App Engine has some pretty draconian restrictions on how much an application can actually do. So it remains to be seen whether GAE will actually scale, and right now you're not even allowed to scale it. Why, exactly, do we say that one system "scales" and the other doesn't, when the actual data available now says pretty much the opposite?

A GAE application may not scale today, but when Our Benefactors over at the big "G" see fit to turn on the juice, you won't have to re-write a single line of your code. It will all magically scale out to their demonstrably amazing pile of computers — assuming you haven't done anything particularly dumb in your own code. All you have to do is throw money at the problem. Well, actually, you throw the money at Google and they will take the problem away for you, and you will never see it again. It accomplishes this by providing you with an API for accessing your data, and forbidding most things that would cause your application to start depending on local state. These restrictions are surprisingly strict if you are trying to write an application that does things other than display web pages and store data, but that functionality does cover a huge number of applications.

Rails, on the other hand, does not provide facilities for scaling. For one thing, it doesn't provide you with a concurrency model. Rails itself is not thread safe, nor does it allow any multiplexing of input and ouptut, so you can't share state between multiple HTTP connections. Yet, Rails encourages you to use "normal" ruby data structures, not inter-process-communication-friendly data structures, to enforce model constraints and do other interesting things. It's easily to add logic to your rails application which is not amenable to splitting over multiple processes, and it's hard to make sure you haven't accidentally done so. When you use the only concurrency model it really supports, i.e. locking and using transactions via the database, Rails strongly encourages you to consider your database connection global, so "sharding" your database requires significant reconsiderations of your application logic.

These technical details are interesting, but they all point to the same thing. The key difference between Rails and GAE is the small matter of writing code. If you write an application with Rails, you probably have to write a whole bunch of new code, or at least change around all of your old code, in order to get it to run on multiple computers. With GAE, the code you start with is the code you scale with.

Economics of Scale

The key feature of "scalability" that most people care about is actually the ability of a system to efficiently convert money to increased capacity. Nobody expects you to be able to run a networked system for a hundred million users on a desktop PC. However, a lot of business people — especially investors — will expect you to be able to run a system for a hundred million users on a data-center with ten million cores in it. Especially if they've just bought one for you.

Coding is an activity that is notoriously inefficient at converting money into other things. It's difficult to predict. It's slow. But most unnervingly to people with money to invest, pouring money on a problematic software project is like pouring water on an oil fire: adding more manpower to a late software project makes it later. If you have a hard software problem, you want to identify it early and add the manpower as soon as possible, because you won't be able to speed things along later if you start running into trouble.

So, the thing that pundits and entrepreneurs alike are thinking about when they start talking about "scalability" is eliminating this extra risky phase of programming. Investors (and entrepreneurs) don't mind investing some money in a "scaling solution", but they don't want to do it when they are in the hockey-stick part of the growth curve, making first impressions with their largest number of customers, and having system failures. So we're all talking about what hot new piece of technology will solve this problem.

At a coarse granularity, this is a useful framing of the issue. Technology investment and third-party tools really can help with scaling. Google and Amazon obviously know what they're doing when it comes to world-spanning scale, and if they're building tools for developers, those tools are going to help.

As you start breaking it down into details, though, problems emerge. Front and center is the problem that scalability is actually a property of a system, not an individual layer of that system, infrastructure or no. Even with the best, sexiest, most automatic scaling layer, you can easily write code that just doesn't scale. As a soon-to-be purveyor of "scalability solutions" myself, this is a scary thought: it's easy to imagine a horror story where a tiny, but hard to discover error in code written on top of our infrastructure makes it difficult to scale up.

That error need not be in the application code. The scaling infrastructure itself could have some small issue which causes problems at higher scales. After all, you can do extensive testing, code review, profiling and load analysis and still miss something that comes up only under extremely high load.

Does Twisted Scale?

Just about any answer to this question that you can imagine is valid, so I'll go through them all and explain what they might mean.

No.

Applications written using Twisted can very easily share lots of state, require local configuration, and do all kinds of things which make them unfriendly to distribution over multiple nodes. Since there is no 'canonical' Twisted application (in fact, you might say that the usual Twisted application is simply an application unusual enough to be unsuited to a more traditional LAMP-type infrastructure), there's no particular documented model for writing a Twisted application that scales up smoothly. None of the included services do anything special to provide automatic scaling. There are no state-management abstractions inside Twisted. If you talk to a database in a Twisted application, the normal way to do it is to use a normal DB-API connection.

When I discussed Rails above, I said that the reason it doesn't scale is that it's too easy, by default, to write applications that don't scale. Therefore we must conclude that Twisted doesn't scale.

Yes.

Twisted is mainly an I/O library, and it uses abstract interfaces to define application code's interface with sockets and timers. Twisted itself includes several different implementations of different strategies for multiplexing between these timers, including several which are platform-specific (kqueue, iocp), squeezing the maximum scale out of your deployment platform, even if it changes.

I said above that infrastructure is scalable if it lets you increase your scale without changing your code. It would make sense to say that Twisted scales because it allows you to increase the number of connections that you're handling by changing your reactor without changing your code.

You could also say that Twisted is scalable because it is an I/O library, and communication between different nodes is almost the definition of scale these days. Not only can you write scalable systems easily using Twisted's facilities, you can use Twisted as a tool to make other systems scale, as part of a bespoke caching daemon or database proxy. Several Twisted users use it this way.

Maybe.

Being mostly an I/O library, Twisted itself is rarely the component most in need of optimization. Being mostly an implementation of mechanisms rather than policies, Twisted gives you what you need to achieve scale but doesn't force, or even encourage you, to use it that way.

For the most part, it's not really interesting to talk about whether Twisted scales or not. The field of possibilities of what you can do with Twisted is too wide open to allow that sort of classification.

What about Divmod? Does Mantissa scale?

Mantissa, lest you have not heard of it already, is the application server that we are developing at Divmod. Mantissa is based on Twisted, among other components. However, there's a big difference in what the answer to the "scaling" question means than it means to Twisted.

Twisted is very general and can be used in almost any type of application, from embedded devices to web services to thick clients to system management consoles. It's almost as general as Python itself — with the notable exception that you can't use Twisted on Google App Engine because they don't allow sockets. As part of being general, Twisted doesn't dictate much about the structure of your application, except that it use an event loop. You can manage persistent state however you want, deal with configuration however you want.

Mantissa, on the other hand, is only for one type of application: multi-user, server-side applications, with web interfaces. You might be able to apply it to something else but you would be fighting it every step of the way. (Although if you wanted to use Mantissa's components for other types of applications, the more general parts decompose neatly into Nevow and Axiom.) So the question of "does it scale" is a bit more interesting, since we can talk about a specific type of application rather than a near-infinite expanse of possibilities. Does Mantissa scale to large numbers of users for these types of "web 2.0" applications?

Unfortunately, the fact that the question is simpler doesn't make the answer that much simpler, so here it is:

Almost...

Mantissa has a few key ingredients that you need to build a system that scales out. The biggest one is a partitioned data-model. Each user has their own database, where their data is stored.

A very common "web 2.0" scaling plan — perhaps the most common — is to have an increasing number of web servers, all pointed at a single giant database with an increasingly ridiculous configuration — gigabytes of RAM, terabytes of disk, fronted by a bank of caching servers. This works for a while. For many sites, it's actually sufficient. But it has a few problems.

For one thing, it has a single point of failure. If your database server goes down, your service goes down. Your database server isn't a lightweight "glue" component, either, so it's not a single point of failure you can quickly recover if it goes down. Even worse, it means that even in the good scenario, where you can scale to capacity, your downtime is increased. Each time you upgrade the database, the whole site goes down. This problem gets compounded because a lot of sites are append-only databases with increasingly large volumes of data to migrate for each upgrade.

Another issue is that it increases load on your administrators, because they are responsible for an increasingly finicky and stressed database server. This may actually be a good thing — administrators are not programmers, after all, and are therefore a more reliable and easier resource to throw money at. Unfortunately there are (almost by definition) fewer things that admins can do to improve the system. Because the admins can't actually solve the root problems that make their lives difficult, it's easier for them to get frustrated and leave for an environment where they won't be so stressed.

The reason websites choose this scaling model is that popular frameworks, or even non-frameworks like "let's just do it in PHP", make it easy to just use a single database, and to write all the application logic to depend on that single database as the point of communication between system components. So the scaling plan is just working with the code that was written before anybody thought about scaling.

If you write an application with Mantissa today, it's easiest to toss the data into different databases depending on who it is for, so when you get to dealing with the "scaling" part of the problem, you can put those databases onto different computers, and avoid the single point of failure. Moreover, when you write an application with Mantissa, you get "global" services like signup and login as part of the application server, so your application code can avoid the usual schema pitfalls (the "users" table, for example) which require a site to have a single large database.

There's only one problem with that plan.

... but not quite.

In my humble opinion, Mantissa offers some interesting ideas, but there are a few reasons you won't get scaling "for free" with Mantissa if you use it right now, today.

You may be noticing about now that I didn't mention any way to communicate between those partitioned chunks of data. This is what I've been spending most of my last few weeks on. I have been working on an implementation of an "eventually consistent" message-passing API for transferring messages between user databases in a Mantissa server. You can see the progress of this work on the Divmod tracker, where the ticket is nearing the end of its year-long journey, and already in its (hopefully) final review.

I'm particularly excited about this feature, because it completes the Mantissa programming model to the point where you can really use it. It's the part of the system that most directly impacts your own code, and thereby allows you to more completely dodge the bullet of modifying a bunch of your application's logic when you want to scale. There might be some dark corners — for example, a scalable API for interacting with the authentication system — but those should only affect a small portion of a small number of applications. Unfortunately communication between databases is not the only issue we have remaining.

There's more to the scaling problem than getting the application code to be the right shape. The infrastructure itself needs to present a container that does the heavy lifting of scalability for the code that it contains. For example, Mantissa needs a name server and a load balancer that will direct requests to the appropriate server for the given chunk of data. It also needs a sign-up and account management interface that will make an informed decision about where to locate a new user's data, and be able to transparently migrate users between servers if load patterns change. Finally there are enhanced features, like replicating read-only data to multiple hosts, for applications (for example, a blogging system) which have heavy concentrations of readers on small portions of data.

Finally there are problems of optimization. We haven't had much time to optimize Mantissa or Athena, and already on small-scale systems we have seen performance issues, especially given the large number of requests that an Athena page can generate. We need to make some time to implement the optimizations we know we need, and when we start scaling up our first really big system, I'm sure that we'll discover other areas that need tweaking.

Why Now?

I'm fond of saying that programming is like frisbee, and predictions more specific than "hey, watch this!" are dangerous. So you might wonder why I'm talking about such a long-running future plan in such detail. You might be wondering why I would think that you'd be interested in something that isn't finished yet. Perhaps you think it's odd that I've described the challenges in such detail rather than being more positive about how awesome it is.

While I certainly don't want to publicly commit to a time-frame for any of this work to be finished, I do feel pretty comfortable saying that it's going to happen. The design for scalability I've discussed here has been a core driving concern for Mantissa since its beginning, and it's something that's increasingly important to our business and our applications.

I'm being especially detailed about Mantissa's incompleteness because I want to make sure that potential users' expectations are set appropriately. I don't want anyone coming to the Divmod stack after having heard me say vague things about "scalability", believing that they'll get an application that scales to the moon.

I do think that this is an exciting time for other developers to get involved though. Mantissa is at a point where there are lots of bits of polish that need to be added to make it truly useful. Starting to investigate it for your application now will give you the opportunity to provide feedback while it's still being formed, before a bunch of final decisions have been made and a lot of application code has been written to rely on them.

More Later...

I've got more to say about scaling, Twisted, and Mantissa, of course. In particular I'd like to explain why I think Mantissa is an interesting scaling strategy and how it compares to the other ones. At this rate, though, I'll only write one blog post this year! I'm sure you hope as much as I do that the next one won't be so long...

Data In, Garbage Out

Wednesday June 04, 2008

"The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information." —Alan Perlis

I switched to blogger recently expecting a more "professional" blogging experience. I thought I'd be able to use a GUI editor and not concern myself with the details of the blog engine. Apparently I was wrong.

Writing that last post, I had some pretty serious problems with getting the formatting to come out right. Blogger does a couple of really terrible things:

When you switch between "Compose" and "Edit HTML" views, some amount of whitespace (although not all of it) is destroyed.
Even when posting using the ATOM API, the posted HTML is mangled in semi-arbitrary ways.
- Properly-quoted "<" and ">" (i.e. "<" and ">") are quoted again.
- Additional line-breaks are added.
-   is converted to white-space, and then
- white space is collapsed.

This is one of the reasons that I'm such a stickler for treating data as structured data, and not making arbitrary heuristic guesses about it. It's not just a matter of handling obscure, nerdy edge cases that average users won't run into. In fact, it's the opposite. Nerds (like myself) can figure out whether you're double-quoting your HTML entities or doing improper whitespace conversions. But what does a regular Joe do when a "frustrated" smiley (">.<") gets converted into some incomprehensible soup of HTML?

I was reminded of this same issue when reading a page on the Habari wiki:

"If you are going to produce real XHTML in a tool usable by ordinary users, then you cannot do it by string concatenation. You need to assemble your content by serializing an XML DOM tree.

If you want to allow plugins, then your plugin API cannot allow plugin authors to stick arbitrary strings in the output. Rather, they should be allowed to add nodes to the DOM tree, or to manipulate existing ones."

Strangely enough, this page concludes that the important thing is not to build their next-generation blogging tool on top of a technology that lets them produce valid output (serializing DOM trees) but that the important thing is not producing valid output, but string concatenation. They very clearly put an implementation technique above a good experience for users.

(This is your brain. This is your brain on PHP. Any questions?)

I don't want to pick on the Habari developers overmuch. After all, the problem that inspired this post was with Blogger, and Wordpress has the same issue. In fact, the Habari guys are mostly notable for having considered the implications of their decision so carefully; it's just a surprise to me that they walked all the way up to the right answer, looked at it, made sure it was right, and then decided to ignore it and keep on going.

Here's the surprise for the Habari developers, and basically everyone else who writes web applications that process HTML: it has nothing to do with XHTML. It is a general principle of software development. The only reason you notice when you're doing XHTML is that the browser isn't correcting for hundreds of minor mistakes, and rather than screwing up immediately it screws up one time in a thousand when a user managed to type a "<" or a "&".

You know what else you can't build with string concatenation? AVIs. PNGs. SWFs. Lots of data on the web is treated as structured, but only because it's too hard for the people who generally build web applications to generate it. If you want to write a program that takes input processes it, and returns output, you need an intermediary structure to hold that data so that you can ensure its validity.

That's not to say that it's always a bad idea to have user interfaces that allow people to type in a syntax that they know and understand, like an "HTML" view. Those interfaces might even be forgiving and correct for lots of errors. Adding line-breaks so that people can type newlines in a mishmash of pseudo-HTML is okay, as long as you know where that ends and your actual structured content begins. For example, if you include a WYSIWYG GUI editor, you should probably internally make sure that WYS really is WYG and you're not making the same kind of heuristic guesses about the data that your own tool generated as some stuff that a user with only a smattering of HTML knowledge typed in directly.

Keeping structured data structured is near and dear to my heart in large part because as systems get ultra large, the different pieces need to be able to talk to each other using clear and unambiguous formats. These points of integration, the places where system A talks to system B (a blogging system talks to a web browser or a blogging client, for example) are absolutely the most critical pieces to test, test, and test again. If you have a bug in your system, you can find it and fix it; but if you have a bug which only arises from an interaction between your system and two others, your test environment needs to be 3 times bigger, and the error is at least 3 times harder to catch. But it gets worse. If you're dealing with 4 systems, then your test environment is 4 times bigger - but the bug is 6 times harder to catch. And so on.

Fred Brooks observed that adding more programmers to a project running behind schedule makes it later. This is because of the additional channels of communication. Now imagine that one of your developers has a curious speech defect: when he says "lasagna" he actually means "critical bug", and vice versa. When he hears one, he understands it as the other. Working alone, this is a harmless eccentricity, but as soon as you put other developers into the mix, strange effects start taking place. He desperately tries to tell them about the delicious lasagna he had last night, and they can't understand why he's losing sleep over it. Or, he is sanguine as his fellow engineers tell him about all the italian food they're eating, while the business is losing millions of dollars.

It's sort of like if every time he said "<" the other developers understood him to mean "<".

If I ever have more than a few hours to work on it, eventually I'll deploy my own blogging platform and I'll know that it can handle HTML correctly. Until then though, I've worked out a strategy for posting to blogger which seems to mostly preserve the formatting that I want to see. I figure that other Python developers might be interested in this, since I frequently see posts to blogger which eat indentation.

I use ScribeFire as my HTML editor. It manages OK, except it doesn't include linebreaks, <p>s or <div>s to separate lines. So, leave the "Convert Line Breaks" option on in your blog's settings.
In "Settings -> Basic -> Global Settings", disable "show compose mode for all your blogs". The compose view is destructive, and switching between it and "Edit HTML" will eat whitespace each time you do it; it also seems to sometimes eat bits of formatting when you publish even if it's just on the page.
Edit a post in ScribeFire. To save drafts, use the "save as note" functionality. This doesn't publish it to be a blogger draft, but there's no way to get the data into blogger directly. You can use the HTML tab as you normally would, to add tags that aren't supported (such as "<pre>").
Switch to the HTML ("<A>") tab in scribefire.
1. select all.
2. copy.
Click "New Post" in the blogger web UI.
1. click in the text field.
2. paste.

The presence of numerous properly-escaped HTML characters in this post should be an indication that it works.