I'm the guy who upgrades (plus I've got regression).

After talking to quite a few people about my "stuponfucious" essay, and mulling over the problem a bit more, I think that I've come up with a few guiding principles on maintaining a codebase which can effectively be upgraded without ever having to resort to drastic "stop the world" database destruction or shutdown measures. I've been wrestling to state these in a concise manner for the last few days, and I think I've finally taken enough words out to post them here.

I'd love for this to be generally applicable, but if it doesn't make sense to you, keep in mind I'm talking specifically about the Divmod product, Quotient.

  1. Upgrade Mechanism With Per Object - Class Granularity

    This means you need a mechanism which can upgrade each instance at the level of each of its classes independently. It's important so that you can do upgrades at the same level of the code that manages the state, to keep that knowledge close together and consistent.

    Twisted already handles this, so I won't belabor the point. However, in my previous posting here, I was not sure that providing this level of flexibility is a good thing. After considering a wide variety of use-cases, it actually is, because it removes limitations from the kind of changes you can adapt to. You still need to be careful about making difficult persistence decisions, but if you have to, you won't be stuck. Making a particular kind of feature impossible will not also make it unnecessary.

    I was finally convinced that this was a really useful facility when I realized that it's not important to keep total continuity of data. It's fine if, on a developer's local machine, they bring up a database, debug for a while, completely destroy that database, then bring up a new one and test again. In fact, the unit tests for persistence should probably work this way, so that we can get some assurances that the database actually works.

    I have made a list of everything I believe we need to encourage and implement the stablest policies regarding data upgrades, as well as remaining flexible and nimble the majority of the time.


  2. Examples and Tests

    If you are really interested in supporting past versions of persistence, like past versions of anything else that you want to support, you must have regression tests. That means a consistent dump of every kind of object that you're going to be outputting from a particular supported version, ideally also in every state they can valid-ly be in in storage.


  3. Reason To Keep Going

    All the development rules and testing discipline in the world isn't going to stick if you don't have some tangible reason the data remain upgradable, and concrete points between which upgrades must work. This can only be provided by a real, running service.

    Without a running service, everyone will just feel as though the upgraders being written are extra overhead, and in reality, if you're not running the code somewhere that needs to be upgraded periodically, they are.


  4. Staging Area

    Although we developers should be concerned with breaking persistence and therefore the running service, in practice this should be a very hard thing to do accidentally. Before a persistence change is rolled out to the running server, it should be run on a "staging" server with an agressive upgrader. (Since the "Versioned" class above is lazy, it is important to actually touch every object in the system.)

    It is bad to break CVS so that the staging upgrade breaks, but like any other test, hopefully it will catch things that we didn't think of.

  5. Discrete Versions

    Some intermediary checkins may not be supported for upgrades, as the upgraders between releases were broken. However, many intermediary versions will probably be supported, since upgrading on a full-release granularity is often too large to test effectively. Each of these versions should have their own tag or other means to easily and quickly compare what has changed between the different persistence schemas. This is useful for maintenance in 2 ways -

  6. Distinction Between Structure and Content

    In any system that has a database layer which can be queried, sorted, etc, the database layer will sometimes need upgrades. Indexes added and removed, etc. Since these changes can be long-running and do not necessarily affect application logic, they should be segregated and queued so as to run as quickly as possible without waiting for all your old objects to catch up.

    In atop this means that Pools need to be annotated with type information so that it's possible to update their indexes and remove/add items that may have come from somewhere else. there can be no general queueing mechanism, because the very place that the queue comes from differs from case to case. In general it will be a query over another pool, which one can save the state of between iterations and continue querying on.


  7. Knowledge of When to Give Up

    Some things aren't new versions of old objects. Superficially they may be similar, but if you're implementing a whole new strategy and interface for manipulating your items, it's probably best to have your upgraders destroy the old objects and create new ones. In situations where the changes really are major, this is both less likely to produce cruft (it's easier to properly initialize a new object than to filter state by hand into a correct new shape on an old one) and easier to monitor a long-running upgrade (It's hard to tell how many Version 3 Foos you have in the system, but it's easy to tell how many objects are in the 'Foo' index and how many are in the NewFoo index.)


  8. Database Inspection Tools

    I've recently written a tool to provide developers with a simple visualization of what's going on within a running Twisted server. Tools like that, enhanced to support the persistence format, should be used to make sure that one has a complete understanding of the objects involved. (In our particular case, I recommend both that object inspector and 'pickletools.dis' for starters. We will need more powerful tools as time goes on.) As JP suggested, these tools should really include a way to manually tweak and destroy parts of a running database. This is important both as a development tool and as a last resort: sometimes, if an upgrader goes subtly wrong (non-subtle wrongness should REALLY be caught in the testing phase) some manual surgery will be required on a few persistent objects or indexes. An interactive prompt should always be the basis for such tools when possible.


  9. Knowledge of when to abuse the infrastructure

    The version upgrader is just running a function on your object. If it seems like the framework doesn't support a particular kind of upgrade, it probably does - you can invoke any code you want, schedule it to run later, kick off an upgrade queue, or whatever seems appropriate to your situation. Don't be afraid of creating scratch objects, temporary work-spaces, and other workarounds if your upgrade is complex.

There are several problems I'm not quite sure how to address, such as the proliferation of upgrade code and compensating for buggy past upgraders - however, on a case-by-case basis I don't think that these issues will be significant.

The most challenging thing to provide here will be the test-case data. Even disregarding the same problems we always run across when looking for a decent corpus of email test data - email tends to be private - we're going to have to provide a tool to dump a live database into a test-friendly format, and then a way to verify that an upgrade "worked". I believe this will be challenging because the only way to really test if the upgrade worked is to simulate a great number of interactions and try to poke as many of the moving parts of each upgraded object as possible. Due to the fact that different versions will be different, these tests are likely to continue changing as each release goes on, and will have to be kept in sync with other unit tests for similar types of object. Parameterizeable unit tests would be a big help with this, although I don't see how to make trial do that easily.

In short, these will be regression tests that have to change every time the code is updated to still properly test the regression. We're actually regressing into the future.

In a future update, I plan to provide more complete examples of how one would do particular kinds of "refactoring" upgrades that are likely to be common - for example, converting a small Python list of objects into a Pool (and vice-versa).

I am in pain because I cannot sleep

... but "Full Metal Alchemist" rocks.

Conneticut sucks.

I was stuck in traffic for about 2 hours solid (without moving) and then had to hunt for gas (another 30 minutes) in conneticut. No 24 hour gas places. they're crazy.

worse was "better", but now "superior" - better now stuponfucious and we don't know what it was

In case the title didn't make it clear, this is a blog entry about data persistence discussing similar themes to RPG's "Worse is Better" essay.

At work, I've been thinking about the problem of versioning data quite a bit. It's a nasty problem, but I think I've gotten it down to a nice, simple paradox:

"The easier you make version migration, the harder version migration will be."

This is a very weird, and particularly intense microcosm of "worse is better". In this case, the difficulty involved in using a tool actually makes the task that the tool has to perform easier. In other words, subjective quality improves as objective quality degrades. Or something like that. Here's a case in point:

In Python, you can easily serialize any object regardless of its layout, as long as it doesn't contain something completely nonsensical like an open file or a graphical button. It takes no extra code.

In Twisted, and by extension in Quotient, any object that is even theoretically persistent in this manner can be upgraded by inheriting a single superclass (twisted.persisted.styles.Versioned) and writing a single method (upgradeToVersion1... or upgradeToVersion7. write as many as you like and they'll be run in order) which shuffles things around inside the object until they're consistent with the current world-view. This is about as easy as upgrading from one version to another can get - the upgrade function is almost always completely self-explanitory:

class Meeting(Versioned):
persistenceVersion = 1
def upgradeToVersion1(self):
# stored IDs instead of references before, oops!
self.people = [reference(self.database, x) for x in self._peopleIDs]
del self._peopleIDs


This is the MIT style of persistence design. (Oddly enough, written while I'm living at MIT.) It is complete, it takes every case into account (there is even code for upgrade ordering dependencies, if you care about such things) and it values simplicity for the user (the "business object" author) rather than the implementor (the maintainer of the framework).

Now, for an example of the New Jersey approach, I will refer to some code that I actually wrote in New Jersey. (This is code that I have done my best to erase from the internet's collective memory. If any of you offer up the ridicule that it so richly deserves, so help me I will erase you in the same fashion.)

Unfortunately I haven't had much experience with this style of persistence, although I am aware that many popular systems use it, including software that costs thousands of dollars per copy and does Very Important Things Indeed for fortune 500 companies.

The style which I am speaking of is explicit persistence, in other words, you have to write a new method for every new object you want to persist, even if it's something dirt simple like two ints and a string. Then, whenever you want to change anything about an object, you have to modify some code that saved and loaded that object. In the code in question - this was the original Twisted Reality codebase - there were 2 methods of note for saving and loading objects. One was public String content() in Thing. The other was [package] boolean handleIt(String) in ThingFactory. These methods were 110 and 283 lines long, respectively. If you wanted to add a new attribute to anything, you had to add some code that looked like this:

else if (tok == "component")
{
thi.setComponent(true);
Age.log("Component bit set",Age.EXTREMELY_VERBOSE);
return true;
}
and like this:

if(isComponent())
{
r.append("\n\tcomponent");
}
to each of those methods. If you actually wanted to change something about the structure of an object, you had to create magic, temporary declarations in the persistence and then interpret them later. Certain kinds of changes weren't even possible.

Now, when a designer who was thus far ignorant of the subtle mysteries of persistence comes across these views of the world, the choice here is obvious. The former is so clearly superior that the latter seems like a cruel joke. It breaks encapsulation! It adds a huge cost to change! It binds unrelated aspects of the code together inextricably! It creates arbitrary, artificial, and unnecessary limitations!

Not all these problems are endemic to explicit persistence, but I wanted to present an obvious straw man here so that it would be really surprising when I couldn't knock it down.

Don't get me wrong - the latter strategy is certainly nicer to work with when one is writing programs. Given the explicit task of upgrading a few simple objects (of the style present in the code from which the New Jersey example is taken) to a new and better representation, the Right Thing wins hands down. This scales up, too - there were no complicated objects in the NJ code specifically because the persistence code pretty much fell over whenever you tried to do anything complicated, so writing and upgrading complicated objects MIT-style is certainly easier than impossible.

But, there is a curious phenomenon that takes place looking at the larger issues pertaining to codebases using these two approaches. In about 2 years of maintaining TwistedReality/Java, when there were bugs in the persistence, they were obvious and immediate - you could dump and object and identify the problem with its representation in a few moments. More importantly, pretty much every version was backwards-compatible to old persistent data. You never had to "blow away" your map files, because they would just load without the spiffy features available in the new map format. And finally, no contributor to the project ever checked in code which broke these constraints - every data-layout change included appropriate persistence changes.

In only about 6 months of maintaining a Right Thing codebase (Quotient) this is certainly not the case. Close to shipping now, we are still wrestling with a system which requires the database be destroyed on every other run. Nobody wants to write persistence code, and hey, the system works if you don't, so why bother? We don't have a policy in place that mandates that everyone must write an upgrader every time they change anything, and again, nobody wants to write persistence code, so since they don't have to they won't. This includes me, so I understand the impulse quite well.

This isn't entirely a fair comparison, of course. TR/J included a lot less persistence-sensitive data than Q does. It had a far simpler charter. It didn't use a database layout, just pure object persistence. However, from experience with the Twisted 'tap' format, those issues are peripheral - Twisted devs. generally don't like taking the time to write persistence upgrade functions either. There are periodically upgrade snafus. What really matters, of course, is that nobody trusts taps to stay stable worth a damn, even though we try really hard to make sure they will be.

Also, towards the end of its life (although there is some question as to whether it is really dead) TR/J began inheriting some characteristics of the Right Thing model (in particular, dynamic properties of arbitrary type), which in turn began creating the same syndrome of problems. In that case, it manifested itself as certain features breaking on particular objects from version to version and requiring operator interventions to fix the data rather than whole-system upgrade explosions, but nevertheless, we couldn't quite shoehorn all the features we needed into a static, single-object model of persistence.

Python has tempted us, we have taken the bite from the Pickle, and we can't ever go back again. A persistence strategy as clearly brain-dead as the one featured in TR/J just isn't going to cut it with the feature-set that we need to support in Quotient. However, we desperately need to encourage the developmental behaviors which that system encouraged, especially keeping a running system going with the same data for an indefinite period of time.

What did the Jersey style do right, then?

  1. Forced Consideration of Impact

    Every time that a programmer made a change to an object that might affect that object's persistence, they had to make a change to the persistence code as well, or they effectively couldn't use their new feature. The data just wouldn't load. This meant that, when faced with a potentially complicated new data structure, they would always ask themselves, "Do I really need to add that?" This might seem like an artificial burden, but in reality it more closely reflected the real cost of change while keeping the actual requirements satisfied, rather than making the cost of change seem artificially low while constantly violating the requirements in the name of expediency.

  2. Immediate Feedback and Testability

    The persistence format was also the introspection format, because it was so simple. Whenever a developer made a change to the persistence code, they could immediately see that change in a very direct way, making it easy to see if they made a mistake. If that code had had any tests (NOT A WORD, I SAID!) then writing them would have been relatively easy too. With an implicit persistence mechanism, the only way to write such a test is to keep an exact, unreadable copy of an old object's data (and of course, all the context that object kept around).

  3. Programmer-Valuable Data Associated with the System

    This is more specific to TR - as we were working on the code, we were also developing a companion dataset, stored in TR's own format, which was equally important to the project as the code itself. It was absolutely 100% imperative to every developer to keep that data working in every minor iteration of the code, because otherwise we couldn't test. I think that ultimately every data-storing project needs something like this to make the developers truly care about versioning.

  4. Separation of Run-Time and Saved State

    All that grotty string-mashing code in TR actually served a purpose - it stripped implementation-dependent run-time data out of the saved file. This meant that we were free to change the implementation of lots of structures without updating the data files (for example, switching a list of strings to a dictionary of string:int, or vice versa) without persistence changes, as long as it could be represented in the same way. In an automatic system, these implementation details are indistinguishable from the core abstract state of an object.

    Oddly enough, although it is brought about through forced duplication of effort (manual specification of the persistence format), it reduced the amount of upgrade code necessary. Because the persistence format was very abstract, you never had to write an upgrader to go from one implementation of behavior for the same persistent data to another. While changing persistence can be frequent, changing implementation is almost by definition more frequent.


I think that's a relatively complete summary of the advantages of manual persistence work, although I'd love to hear comments upon it.

How can we replicate these advantages?

I think that an important first step is to find some simple, lightweight way to completely express the necessary information for the persistence of an object. Even if we still use Pickle to store this data, an explicit specification of what it should look like would be a good mental exercise for us. It would also provide a means to test upgrading and to represent the format of old versions without having to copy their entire implementation. In short, the "schema" that Twisted World provided and New PB is about to provide again. The outstanding work in my sandbox on indexed and referenced properties in Atop is an important first step here.

We also need some critical data to be stored in the database that can't be exported, re-loaded, or otherwise passed through some external crutch versioning mechanism. We need to care about our core data's dependability as we move forward.

We also need to decide what kinds of data we really care about. There are certain aspects of the Quotient application which are developing so fast that it's impossible to effectively represent them persistence-wise, and it would really be a waste of time. Such objects should probably never be persisted in the first place - just provide an 'experimental' flag or somesuch, indicating that the object should never touch an on-disk database. When this becomes burdensome, the programmer can un-set it and manually start performing updates.

There's more to say, but this has already been quite a ramble! I hope that you've enjoyed reading it. Please keep in mind that I would like feedback and more ideas about how to perform the transitions I've suggested over a relatively large existing system. (Quickly, of course. And cheaply. ^_^)

Until next time.

lunchtime apocalypse

This is hastily transcribed almost verbatim from one side of an IM conversation, so apologies if I left something important out.

I just read Jason Asbahr's blog. He has a great link to a speech by Michael Crichton about the state of the environmentalist movement.

In keeping with his religious theme, I think his argument backs up my general feeling that the end of all life is really the only solution to the eternal questions that humanity struggles with.

It would be nice if Crichton put together a website to factually verify some of those claims he's making - such as the claim that DDT is harmless and its ban has killed millions worldwide - but the specific examples weren't terribly interesting to me. I'm willing to accept the possibility of their truth on the basis of numerous other examples that I've heard better factual evidence for.

Not a big environmentalist myself, I still find the main thrust of his speech really unnerving, though I don't think he approached it directly enough - how do you mobilize the public to act upon vague, potentially unpleasant, and most of all, changing information?

How do you sensationalize, propagandize, and thereby positively politicize rational thought? It seems like you'd have to in a mass-media democratic society if you want it to be allowed to have any impact. Ayn Rand tried pretty hard, and look where that got her. I've talked about what's wrong with objectivism before. Not all of it is due to the problem of mass-media exposure and misinterpretation, but the end result is the same - one set of assumptions get substituted for another, and you repurpose a few key buzzwords - "reason", "truth", and "love" to name a few - to mean something different than what they did in your previous ontological framework, and you're done. (Extra credit: find out what these three words mean in your favorite belief system, and how their meanings differ from the dictionary definitions.)

I've started to think lately that you really can't, that it's impossible, and that people have to be inculcated with a critical spirit from birth or they will never really acquire one. They may sway from the Church of Christ to the Church of Rand to the anti-church of LaVey, but they'll continue holding extreme positions with re-heated fallacious arguments without questioning of any of their beliefs until they have a crisis, at which point they question ALL of them.

More optimistically, It doesn't really matter that you can't discuss subtle and complicated issues in a broad public forum. Humanity staggers on anyway. The important thing is to prevent the criminalization of rational discourse and inquiry. That conflict is substantially easier to dumb down and sensationalize. At any given time, in the USA anyway, there is inevitably an unpopular thinker whose punishment is far enough out of proportion with his (thought)crime that the public can express some indignant outrage about allowing him to think whatever he's thinking (usually, with the implicit subtext of "as long as I don't have to listen to it").

The wonderful thing about rational thought is that it actually allows you to manipulate the natural world more effectively, which makes it less important that the dominant forces in society care about your methods and beliefs - as long as they're not actively working against you.

And that's why I'm a member of the Electronic Frontier Foundation. :-)