Databases and Twisted: When Threads Are OK (For Some Purposes)

Saturday December 20, 2008
Last month, a thread on the twisted-web mailing list got me thinking about a frequently implemented, but seldom understood usage of Twisted: writing applications backed by a traditional database server.  I tried to write a timely reply on the mailing list, but found what I had to say on the topic was overflowing the polite bounds of an email message.  I've tried to write about this before, buried in the middle of a post about something else.  I don't think I really got my message across, though, because I believe this was quoted as saying that I find asynchronous data-access APIs "extremely painful".  Asynchronousness is not the point that I find difficult, as much as I do transactionality (and integration with existing database bindings).

So this time, please bear with me as I explain enough context to properly frame my opinion.

I think that concurrency is a difficult problem that affects every aspect of your code, and so it is important to have a comprehensive, consistent, and easy to understand plan to deal with it in any given system.

Twisted's "cooperatively multitasking / callbacks-scheduled-by-I/O-and-timers" idiom is one concurrency model.  Deferreds are a super important convenience mechanism in that model, but they're not completely necessary; you can do this with just dataReceived, connectionLost, callLater etc.

In general this model - let's call it something memorable, like "CM/CSBIOAT" - is a pretty easy concurrency model to work with once you know how it works.  In particular, it's pretty easy to avoid making a common variety of serious concurrency mistakes, since you don't need to remember to declare any locks, and the behavior of the system under load is unsurprising, if not necessarily ideal for performance.

Threading is another concurrency model with which we are all familiar.  Shared-state multithreading is a pretty bad concurrency model for general use.  In particular, it's very easy to make mistakes that are impossible to diagnose or reproduce.  Despite its unsuitability for applications, threading can be a useful building-block as a low-level tool to construct higher-level concurrency models.  In many practical cases I am aware of, threading is the only available building-block at this level for building efficient implementations of other concurrency models, because operating systems and compilers don't provide anything better.

There is an antipattern that arises from a somewhat naive understanding of these two models.  The Twisted novitiate discovers that Twisted Is Good, and Threads Are Bad.  Experimentally they discover that this is indeed true, and that despite its eccentricities, writing and debugging "Twisted" code (whose benefits really come from the CM/CSBIOAT pattern) is a lot easier than writing and debugging threaded code.

So, our unfortunate Twisted novice now needs to write a database application: what to do?  Well, one way they can write it is to "just use threads" for data-access logic and communicate with Twisted some other how - for example, to put all their database logic in a function they pass to runInteraction.  The only other apparent option is to "use Deferreds" and invoke adbapi.ConnectionPool.runQuery or runOperation.  Deferreds are Twisted - Good!  Threads are ... threads!  Bad!  The answer seems obvious.

However, in choosing to use this facility, you've done far more than choosing between "twisted" and "threads".  If you use runInteraction, you can easily keep all of your work in a single transaction; since database APIs are blocking, you can only safely do a read followed by a write in the same transaction if you can block between those calls.  If you do a runQuery, take the result of that and pass it as input to a runOperation, you're sharing data between two different transactions and potentially two different cursors.  Whether Deferreds or good or not, this breaks the assumptions that the underlying database uses to keep its data consistent.  Consider incrementing a counter.  In the "threaded" case, you'd do something like this:

  def interaction(txn):
      x = doSql(txn, "select thingy from foo where bar = baz;")
      doSql(txn, "update foo set thingy = ? where bar = baz;", x+1)

This always results in foo.thingy being set to foo.thingy + 1.  If your database is set up properly (and most are by default) there's no opportunity for other code to execute between those two statements.

But in the "twisted" case you do something like this:

  def stuff(cp):
      x = yield cp.runQuery("select thingy from foo where bar = baz;");
      yield cp.runOperation("update foo set thingy = ? where bar = baz;", x+1)

As syntactically pleasant as that appears to be, and as convenient as it might seem to be able to call Twisted APIs as much as you want in the middle of this work, any amount of code can run between the first line and the second, thanks to those pesky 'yield's.  That means if you run 'stuff' twice, there's a good chance that your callbacks will stomp on each other and one of the increments will be lost.

Transactional relational database access is a really different concurrency model, all its own.  In many cases it appears to be the same as plain old shared-state multithreading; not least of which because it is implemented using threads and the threads are completely exposed to application code, and made part of the database interface's API.  However, using a transactional database to store your interesting state is much, much safer than just using threads to access any old datastructure.  An ACID database is specifically implemented to provide a consistent view of your data to any executing client at any time, and in the cases where that would be impossible, to schedule execution of various clients to provide an ordering where data is consistent.  (You'll notice that I have avoided saying "thread", but in practice an executing SQL client is a thread in your application.)  Caching middleware confuses this issue, making it more like regular multithreading; but in a good SQL database, using threads rather than just separate processes is just an optimization; one which should be completely transparent to your application code.

Axiom doesn't really have a concurrency model (it ought to, but that's a discussion for another day).  The idea there is that, like the rest of Twisted, you try hard never to block for too long.  It is possible — too easy, really — to screw this up and block for a long time waiting for the disk in an Axiom program, but to some extent that's true of any Python code.  Since Axiom is typically accessed by one, or at most two processes at a time, you won't end up blocking on your database for a long time because some other code is using it; the main thing Twisted's concurrency model is designed to prevent is your code blocking and getting stuck or being idle, not your code blocking at all.  So, I'm going to give you advice for using Storm or ADBAPI: the only advice for Axiom is "write fast queries".

Assuming that you're writing a traditional database application, here's my advice for you.

Let's assume that Storm (or ADBAPI) does not have any thread-safety bugs itself.  This assumption is unlikely to be completely true, but you probably have to make it regardless if you're going to use either of these things at all, regardless of my advice :).  With that assumption, you can use Storm (or ADBAPI) with Twisted from a thread-pool and pretend, in your application, that the threads do not exist.  You should avoid accessing global state and pretend that your code might be run in a subprocess or a thread or even on a different computer. If you're lucky, one day it will be, and your application will "magically" scale!  If you follow this simple discipline, you can cleanly interface between the Twisted concurrency model (where you do all of your non-database I/O) and the RDBMS concurrency model (where you interact with all of your "data" objects).

Don't touch any database objects in your Twisted mainloop.  Don't touch any Twisted objects in your database transactions.  This has the added benefit of not needing to worry that you're sending out information about partially-completed database operations to a network connection, or injecting potentially transient network state into a persistent database operation that may need to be re-tried.

In theory, there's nothing stopping an asynchronous data-access API from doing all of the same stuff that I just described threads doing.  All you'd need is good non-blocking database infrastructure, non-blocking transactions, and a bunch of code to associate a running transaction object with a particular database transaction and cursor.  It is possible, if you go down to the database-protocol level, to write a database wrapper which actually integrated with the Twisted concurrency model and treated your database as just another source of input and output.  In terms of preventing errors and assisting making code testable and deterministic, I think this would be an improvement over the threaded version of this solution.

However, implementing such an improvement would likely take quite a bit of time.  Time that most small database-backed projects don't have, so it's unlikely someone will need to scratch this particular itch any time soon.  Even if someone did do all that work for one database, it's likely that a lot of it would need to be done over again for each subsequent set of database bindings; so, using a DB-API module in a thread would remain the only way to retain database portability.

For the moment, threads and threadpools are the tools that existing database bindings give us to manage transactions, and it's likely that they're adequate for a huge majority of applications.  The only real problem is that you can't completely hide threads from the application and make sure they're not being used for evil.