worse was "better", but now "superior" - better now stuponfucious and we don't know what it was

Saturday December 13, 2003
In case the title didn't make it clear, this is a blog entry about data persistence discussing similar themes to RPG's "Worse is Better" essay.

At work, I've been thinking about the problem of versioning data quite a bit. It's a nasty problem, but I think I've gotten it down to a nice, simple paradox:

"The easier you make version migration, the harder version migration will be."

This is a very weird, and particularly intense microcosm of "worse is better". In this case, the difficulty involved in using a tool actually makes the task that the tool has to perform easier. In other words, subjective quality improves as objective quality degrades. Or something like that. Here's a case in point:

In Python, you can easily serialize any object regardless of its layout, as long as it doesn't contain something completely nonsensical like an open file or a graphical button. It takes no extra code.

In Twisted, and by extension in Quotient, any object that is even theoretically persistent in this manner can be upgraded by inheriting a single superclass (twisted.persisted.styles.Versioned) and writing a single method (upgradeToVersion1... or upgradeToVersion7. write as many as you like and they'll be run in order) which shuffles things around inside the object until they're consistent with the current world-view. This is about as easy as upgrading from one version to another can get - the upgrade function is almost always completely self-explanitory:

class Meeting(Versioned):
persistenceVersion = 1
def upgradeToVersion1(self):
# stored IDs instead of references before, oops!
self.people = [reference(self.database, x) for x in self._peopleIDs]
del self._peopleIDs

This is the MIT style of persistence design. (Oddly enough, written while I'm living at MIT.) It is complete, it takes every case into account (there is even code for upgrade ordering dependencies, if you care about such things) and it values simplicity for the user (the "business object" author) rather than the implementor (the maintainer of the framework).

Now, for an example of the New Jersey approach, I will refer to some code that I actually wrote in New Jersey. (This is code that I have done my best to erase from the internet's collective memory. If any of you offer up the ridicule that it so richly deserves, so help me I will erase you in the same fashion.)

Unfortunately I haven't had much experience with this style of persistence, although I am aware that many popular systems use it, including software that costs thousands of dollars per copy and does Very Important Things Indeed for fortune 500 companies.

The style which I am speaking of is explicit persistence, in other words, you have to write a new method for every new object you want to persist, even if it's something dirt simple like two ints and a string. Then, whenever you want to change anything about an object, you have to modify some code that saved and loaded that object. In the code in question - this was the original Twisted Reality codebase - there were 2 methods of note for saving and loading objects. One was public String content() in Thing. The other was [package] boolean handleIt(String) in ThingFactory. These methods were 110 and 283 lines long, respectively. If you wanted to add a new attribute to anything, you had to add some code that looked like this:

else if (tok == "component")
Age.log("Component bit set",Age.EXTREMELY_VERBOSE);
return true;
and like this:

to each of those methods. If you actually wanted to change something about the structure of an object, you had to create magic, temporary declarations in the persistence and then interpret them later. Certain kinds of changes weren't even possible.

Now, when a designer who was thus far ignorant of the subtle mysteries of persistence comes across these views of the world, the choice here is obvious. The former is so clearly superior that the latter seems like a cruel joke. It breaks encapsulation! It adds a huge cost to change! It binds unrelated aspects of the code together inextricably! It creates arbitrary, artificial, and unnecessary limitations!

Not all these problems are endemic to explicit persistence, but I wanted to present an obvious straw man here so that it would be really surprising when I couldn't knock it down.

Don't get me wrong - the latter strategy is certainly nicer to work with when one is writing programs. Given the explicit task of upgrading a few simple objects (of the style present in the code from which the New Jersey example is taken) to a new and better representation, the Right Thing wins hands down. This scales up, too - there were no complicated objects in the NJ code specifically because the persistence code pretty much fell over whenever you tried to do anything complicated, so writing and upgrading complicated objects MIT-style is certainly easier than impossible.

But, there is a curious phenomenon that takes place looking at the larger issues pertaining to codebases using these two approaches. In about 2 years of maintaining TwistedReality/Java, when there were bugs in the persistence, they were obvious and immediate - you could dump and object and identify the problem with its representation in a few moments. More importantly, pretty much every version was backwards-compatible to old persistent data. You never had to "blow away" your map files, because they would just load without the spiffy features available in the new map format. And finally, no contributor to the project ever checked in code which broke these constraints - every data-layout change included appropriate persistence changes.

In only about 6 months of maintaining a Right Thing codebase (Quotient) this is certainly not the case. Close to shipping now, we are still wrestling with a system which requires the database be destroyed on every other run. Nobody wants to write persistence code, and hey, the system works if you don't, so why bother? We don't have a policy in place that mandates that everyone must write an upgrader every time they change anything, and again, nobody wants to write persistence code, so since they don't have to they won't. This includes me, so I understand the impulse quite well.

This isn't entirely a fair comparison, of course. TR/J included a lot less persistence-sensitive data than Q does. It had a far simpler charter. It didn't use a database layout, just pure object persistence. However, from experience with the Twisted 'tap' format, those issues are peripheral - Twisted devs. generally don't like taking the time to write persistence upgrade functions either. There are periodically upgrade snafus. What really matters, of course, is that nobody trusts taps to stay stable worth a damn, even though we try really hard to make sure they will be.

Also, towards the end of its life (although there is some question as to whether it is really dead) TR/J began inheriting some characteristics of the Right Thing model (in particular, dynamic properties of arbitrary type), which in turn began creating the same syndrome of problems. In that case, it manifested itself as certain features breaking on particular objects from version to version and requiring operator interventions to fix the data rather than whole-system upgrade explosions, but nevertheless, we couldn't quite shoehorn all the features we needed into a static, single-object model of persistence.

Python has tempted us, we have taken the bite from the Pickle, and we can't ever go back again. A persistence strategy as clearly brain-dead as the one featured in TR/J just isn't going to cut it with the feature-set that we need to support in Quotient. However, we desperately need to encourage the developmental behaviors which that system encouraged, especially keeping a running system going with the same data for an indefinite period of time.

What did the Jersey style do right, then?

  1. Forced Consideration of Impact

    Every time that a programmer made a change to an object that might affect that object's persistence, they had to make a change to the persistence code as well, or they effectively couldn't use their new feature. The data just wouldn't load. This meant that, when faced with a potentially complicated new data structure, they would always ask themselves, "Do I really need to add that?" This might seem like an artificial burden, but in reality it more closely reflected the real cost of change while keeping the actual requirements satisfied, rather than making the cost of change seem artificially low while constantly violating the requirements in the name of expediency.

  2. Immediate Feedback and Testability

    The persistence format was also the introspection format, because it was so simple. Whenever a developer made a change to the persistence code, they could immediately see that change in a very direct way, making it easy to see if they made a mistake. If that code had had any tests (NOT A WORD, I SAID!) then writing them would have been relatively easy too. With an implicit persistence mechanism, the only way to write such a test is to keep an exact, unreadable copy of an old object's data (and of course, all the context that object kept around).

  3. Programmer-Valuable Data Associated with the System

    This is more specific to TR - as we were working on the code, we were also developing a companion dataset, stored in TR's own format, which was equally important to the project as the code itself. It was absolutely 100% imperative to every developer to keep that data working in every minor iteration of the code, because otherwise we couldn't test. I think that ultimately every data-storing project needs something like this to make the developers truly care about versioning.

  4. Separation of Run-Time and Saved State

    All that grotty string-mashing code in TR actually served a purpose - it stripped implementation-dependent run-time data out of the saved file. This meant that we were free to change the implementation of lots of structures without updating the data files (for example, switching a list of strings to a dictionary of string:int, or vice versa) without persistence changes, as long as it could be represented in the same way. In an automatic system, these implementation details are indistinguishable from the core abstract state of an object.

    Oddly enough, although it is brought about through forced duplication of effort (manual specification of the persistence format), it reduced the amount of upgrade code necessary. Because the persistence format was very abstract, you never had to write an upgrader to go from one implementation of behavior for the same persistent data to another. While changing persistence can be frequent, changing implementation is almost by definition more frequent.

I think that's a relatively complete summary of the advantages of manual persistence work, although I'd love to hear comments upon it.

How can we replicate these advantages?

I think that an important first step is to find some simple, lightweight way to completely express the necessary information for the persistence of an object. Even if we still use Pickle to store this data, an explicit specification of what it should look like would be a good mental exercise for us. It would also provide a means to test upgrading and to represent the format of old versions without having to copy their entire implementation. In short, the "schema" that Twisted World provided and New PB is about to provide again. The outstanding work in my sandbox on indexed and referenced properties in Atop is an important first step here.

We also need some critical data to be stored in the database that can't be exported, re-loaded, or otherwise passed through some external crutch versioning mechanism. We need to care about our core data's dependability as we move forward.

We also need to decide what kinds of data we really care about. There are certain aspects of the Quotient application which are developing so fast that it's impossible to effectively represent them persistence-wise, and it would really be a waste of time. Such objects should probably never be persisted in the first place - just provide an 'experimental' flag or somesuch, indicating that the object should never touch an on-disk database. When this becomes burdensome, the programmer can un-set it and manually start performing updates.

There's more to say, but this has already been quite a ramble! I hope that you've enjoyed reading it. Please keep in mind that I would like feedback and more ideas about how to perform the transitions I've suggested over a relatively large existing system. (Quickly, of course. And cheaply. ^_^)

Until next time.