... but "Full Metal Alchemist" rocks.
I was stuck in traffic for about 2 hours solid (without moving) and then had
to hunt for gas (another 30 minutes) in conneticut. No 24 hour gas places.
they're crazy.
In case the title didn't make it clear, this is a blog entry about data
persistence discussing similar themes to RPG's "Worse is Better"
essay.
At work, I've been thinking about the problem of versioning data quite a bit. It's a nasty problem, but I think I've gotten it down to a nice, simple paradox:
"The easier you make version migration, the harder version migration will be."
This is a very weird, and particularly intense microcosm of "worse is better". In this case, the difficulty involved in using a tool actually makes the task that the tool has to perform easier. In other words, subjective quality improves as objective quality degrades. Or something like that. Here's a case in point:
In Python, you can easily serialize any object regardless of its layout, as long as it doesn't contain something completely nonsensical like an open file or a graphical button. It takes no extra code.
In Twisted, and by extension in Quotient, any object that is even theoretically persistent in this manner can be upgraded by inheriting a single superclass (twisted.persisted.styles.Versioned) and writing a single method (upgradeToVersion1... or upgradeToVersion7. write as many as you like and they'll be run in order) which shuffles things around inside the object until they're consistent with the current world-view. This is about as easy as upgrading from one version to another can get - the upgrade function is almost always completely self-explanitory:
This is the MIT style of persistence design. (Oddly enough, written while I'm living at MIT.) It is complete, it takes every case into account (there is even code for upgrade ordering dependencies, if you care about such things) and it values simplicity for the user (the "business object" author) rather than the implementor (the maintainer of the framework).
Now, for an example of the New Jersey approach, I will refer to some code that I actually wrote in New Jersey. (This is code that I have done my best to erase from the internet's collective memory. If any of you offer up the ridicule that it so richly deserves, so help me I will erase you in the same fashion.)
Unfortunately I haven't had much experience with this style of persistence, although I am aware that many popular systems use it, including software that costs thousands of dollars per copy and does Very Important Things Indeed for fortune 500 companies.
The style which I am speaking of is explicit persistence, in other words, you have to write a new method for every new object you want to persist, even if it's something dirt simple like two ints and a string. Then, whenever you want to change anything about an object, you have to modify some code that saved and loaded that object. In the code in question - this was the original Twisted Reality codebase - there were 2 methods of note for saving and loading objects. One was public String content() in Thing. The other was [package] boolean handleIt(String) in ThingFactory. These methods were 110 and 283 lines long, respectively. If you wanted to add a new attribute to anything, you had to add some code that looked like this:
Now, when a designer who was thus far ignorant of the subtle mysteries of persistence comes across these views of the world, the choice here is obvious. The former is so clearly superior that the latter seems like a cruel joke. It breaks encapsulation! It adds a huge cost to change! It binds unrelated aspects of the code together inextricably! It creates arbitrary, artificial, and unnecessary limitations!
Not all these problems are endemic to explicit persistence, but I wanted to present an obvious straw man here so that it would be really surprising when I couldn't knock it down.
Don't get me wrong - the latter strategy is certainly nicer to work with when one is writing programs. Given the explicit task of upgrading a few simple objects (of the style present in the code from which the New Jersey example is taken) to a new and better representation, the Right Thing wins hands down. This scales up, too - there were no complicated objects in the NJ code specifically because the persistence code pretty much fell over whenever you tried to do anything complicated, so writing and upgrading complicated objects MIT-style is certainly easier than impossible.
But, there is a curious phenomenon that takes place looking at the larger issues pertaining to codebases using these two approaches. In about 2 years of maintaining TwistedReality/Java, when there were bugs in the persistence, they were obvious and immediate - you could dump and object and identify the problem with its representation in a few moments. More importantly, pretty much every version was backwards-compatible to old persistent data. You never had to "blow away" your map files, because they would just load without the spiffy features available in the new map format. And finally, no contributor to the project ever checked in code which broke these constraints - every data-layout change included appropriate persistence changes.
In only about 6 months of maintaining a Right Thing codebase (Quotient) this is certainly not the case. Close to shipping now, we are still wrestling with a system which requires the database be destroyed on every other run. Nobody wants to write persistence code, and hey, the system works if you don't, so why bother? We don't have a policy in place that mandates that everyone must write an upgrader every time they change anything, and again, nobody wants to write persistence code, so since they don't have to they won't. This includes me, so I understand the impulse quite well.
This isn't entirely a fair comparison, of course. TR/J included a lot less persistence-sensitive data than Q does. It had a far simpler charter. It didn't use a database layout, just pure object persistence. However, from experience with the Twisted 'tap' format, those issues are peripheral - Twisted devs. generally don't like taking the time to write persistence upgrade functions either. There are periodically upgrade snafus. What really matters, of course, is that nobody trusts taps to stay stable worth a damn, even though we try really hard to make sure they will be.
Also, towards the end of its life (although there is some question as to whether it is really dead) TR/J began inheriting some characteristics of the Right Thing model (in particular, dynamic properties of arbitrary type), which in turn began creating the same syndrome of problems. In that case, it manifested itself as certain features breaking on particular objects from version to version and requiring operator interventions to fix the data rather than whole-system upgrade explosions, but nevertheless, we couldn't quite shoehorn all the features we needed into a static, single-object model of persistence.
Python has tempted us, we have taken the bite from the Pickle, and we can't ever go back again. A persistence strategy as clearly brain-dead as the one featured in TR/J just isn't going to cut it with the feature-set that we need to support in Quotient. However, we desperately need to encourage the developmental behaviors which that system encouraged, especially keeping a running system going with the same data for an indefinite period of time.
What did the Jersey style do right, then?
I think that's a relatively complete summary of the advantages of manual persistence work, although I'd love to hear comments upon it.
How can we replicate these advantages?
I think that an important first step is to find some simple, lightweight way to completely express the necessary information for the persistence of an object. Even if we still use Pickle to store this data, an explicit specification of what it should look like would be a good mental exercise for us. It would also provide a means to test upgrading and to represent the format of old versions without having to copy their entire implementation. In short, the "schema" that Twisted World provided and New PB is about to provide again. The outstanding work in my sandbox on indexed and referenced properties in Atop is an important first step here.
We also need some critical data to be stored in the database that can't be exported, re-loaded, or otherwise passed through some external crutch versioning mechanism. We need to care about our core data's dependability as we move forward.
We also need to decide what kinds of data we really care about. There are certain aspects of the Quotient application which are developing so fast that it's impossible to effectively represent them persistence-wise, and it would really be a waste of time. Such objects should probably never be persisted in the first place - just provide an 'experimental' flag or somesuch, indicating that the object should never touch an on-disk database. When this becomes burdensome, the programmer can un-set it and manually start performing updates.
There's more to say, but this has already been quite a ramble! I hope that you've enjoyed reading it. Please keep in mind that I would like feedback and more ideas about how to perform the transitions I've suggested over a relatively large existing system. (Quickly, of course. And cheaply. ^_^)
Until next time.
At work, I've been thinking about the problem of versioning data quite a bit. It's a nasty problem, but I think I've gotten it down to a nice, simple paradox:
"The easier you make version migration, the harder version migration will be."
This is a very weird, and particularly intense microcosm of "worse is better". In this case, the difficulty involved in using a tool actually makes the task that the tool has to perform easier. In other words, subjective quality improves as objective quality degrades. Or something like that. Here's a case in point:
In Python, you can easily serialize any object regardless of its layout, as long as it doesn't contain something completely nonsensical like an open file or a graphical button. It takes no extra code.
In Twisted, and by extension in Quotient, any object that is even theoretically persistent in this manner can be upgraded by inheriting a single superclass (twisted.persisted.styles.Versioned) and writing a single method (upgradeToVersion1... or upgradeToVersion7. write as many as you like and they'll be run in order) which shuffles things around inside the object until they're consistent with the current world-view. This is about as easy as upgrading from one version to another can get - the upgrade function is almost always completely self-explanitory:
class Meeting(Versioned):
persistenceVersion = 1
def upgradeToVersion1(self):
# stored IDs instead of references before, oops!
self.people = [reference(self.database, x) for x in self._peopleIDs]
del self._peopleIDs
This is the MIT style of persistence design. (Oddly enough, written while I'm living at MIT.) It is complete, it takes every case into account (there is even code for upgrade ordering dependencies, if you care about such things) and it values simplicity for the user (the "business object" author) rather than the implementor (the maintainer of the framework).
Now, for an example of the New Jersey approach, I will refer to some code that I actually wrote in New Jersey. (This is code that I have done my best to erase from the internet's collective memory. If any of you offer up the ridicule that it so richly deserves, so help me I will erase you in the same fashion.)
Unfortunately I haven't had much experience with this style of persistence, although I am aware that many popular systems use it, including software that costs thousands of dollars per copy and does Very Important Things Indeed for fortune 500 companies.
The style which I am speaking of is explicit persistence, in other words, you have to write a new method for every new object you want to persist, even if it's something dirt simple like two ints and a string. Then, whenever you want to change anything about an object, you have to modify some code that saved and loaded that object. In the code in question - this was the original Twisted Reality codebase - there were 2 methods of note for saving and loading objects. One was public String content() in Thing. The other was [package] boolean handleIt(String) in ThingFactory. These methods were 110 and 283 lines long, respectively. If you wanted to add a new attribute to anything, you had to add some code that looked like this:
and like this:
else if (tok == "component")
{
thi.setComponent(true);
Age.log("Component bit set",Age.EXTREMELY_VERBOSE);
return true;
}
to each of those methods. If you actually wanted to change something about the structure of an object, you had to create magic, temporary declarations in the persistence and then interpret them later. Certain kinds of changes weren't even possible.
if(isComponent())
{
r.append("\n\tcomponent");
}
Now, when a designer who was thus far ignorant of the subtle mysteries of persistence comes across these views of the world, the choice here is obvious. The former is so clearly superior that the latter seems like a cruel joke. It breaks encapsulation! It adds a huge cost to change! It binds unrelated aspects of the code together inextricably! It creates arbitrary, artificial, and unnecessary limitations!
Not all these problems are endemic to explicit persistence, but I wanted to present an obvious straw man here so that it would be really surprising when I couldn't knock it down.
Don't get me wrong - the latter strategy is certainly nicer to work with when one is writing programs. Given the explicit task of upgrading a few simple objects (of the style present in the code from which the New Jersey example is taken) to a new and better representation, the Right Thing wins hands down. This scales up, too - there were no complicated objects in the NJ code specifically because the persistence code pretty much fell over whenever you tried to do anything complicated, so writing and upgrading complicated objects MIT-style is certainly easier than impossible.
But, there is a curious phenomenon that takes place looking at the larger issues pertaining to codebases using these two approaches. In about 2 years of maintaining TwistedReality/Java, when there were bugs in the persistence, they were obvious and immediate - you could dump and object and identify the problem with its representation in a few moments. More importantly, pretty much every version was backwards-compatible to old persistent data. You never had to "blow away" your map files, because they would just load without the spiffy features available in the new map format. And finally, no contributor to the project ever checked in code which broke these constraints - every data-layout change included appropriate persistence changes.
In only about 6 months of maintaining a Right Thing codebase (Quotient) this is certainly not the case. Close to shipping now, we are still wrestling with a system which requires the database be destroyed on every other run. Nobody wants to write persistence code, and hey, the system works if you don't, so why bother? We don't have a policy in place that mandates that everyone must write an upgrader every time they change anything, and again, nobody wants to write persistence code, so since they don't have to they won't. This includes me, so I understand the impulse quite well.
This isn't entirely a fair comparison, of course. TR/J included a lot less persistence-sensitive data than Q does. It had a far simpler charter. It didn't use a database layout, just pure object persistence. However, from experience with the Twisted 'tap' format, those issues are peripheral - Twisted devs. generally don't like taking the time to write persistence upgrade functions either. There are periodically upgrade snafus. What really matters, of course, is that nobody trusts taps to stay stable worth a damn, even though we try really hard to make sure they will be.
Also, towards the end of its life (although there is some question as to whether it is really dead) TR/J began inheriting some characteristics of the Right Thing model (in particular, dynamic properties of arbitrary type), which in turn began creating the same syndrome of problems. In that case, it manifested itself as certain features breaking on particular objects from version to version and requiring operator interventions to fix the data rather than whole-system upgrade explosions, but nevertheless, we couldn't quite shoehorn all the features we needed into a static, single-object model of persistence.
Python has tempted us, we have taken the bite from the Pickle, and we can't ever go back again. A persistence strategy as clearly brain-dead as the one featured in TR/J just isn't going to cut it with the feature-set that we need to support in Quotient. However, we desperately need to encourage the developmental behaviors which that system encouraged, especially keeping a running system going with the same data for an indefinite period of time.
What did the Jersey style do right, then?
- Forced Consideration of Impact
Every time that a programmer made a change to an object that might affect that object's persistence, they had to make a change to the persistence code as well, or they effectively couldn't use their new feature. The data just wouldn't load. This meant that, when faced with a potentially complicated new data structure, they would always ask themselves, "Do I really need to add that?" This might seem like an artificial burden, but in reality it more closely reflected the real cost of change while keeping the actual requirements satisfied, rather than making the cost of change seem artificially low while constantly violating the requirements in the name of expediency.
- Immediate Feedback and Testability
The persistence format was also the introspection format, because it was so simple. Whenever a developer made a change to the persistence code, they could immediately see that change in a very direct way, making it easy to see if they made a mistake. If that code had had any tests (NOT A WORD, I SAID!) then writing them would have been relatively easy too. With an implicit persistence mechanism, the only way to write such a test is to keep an exact, unreadable copy of an old object's data (and of course, all the context that object kept around).
- Programmer-Valuable Data Associated with the System
This is more specific to TR - as we were working on the code, we were also developing a companion dataset, stored in TR's own format, which was equally important to the project as the code itself. It was absolutely 100% imperative to every developer to keep that data working in every minor iteration of the code, because otherwise we couldn't test. I think that ultimately every data-storing project needs something like this to make the developers truly care about versioning.
- Separation of Run-Time and Saved State
All that grotty string-mashing code in TR actually served a purpose - it stripped implementation-dependent run-time data out of the saved file. This meant that we were free to change the implementation of lots of structures without updating the data files (for example, switching a list of strings to a dictionary of string:int, or vice versa) without persistence changes, as long as it could be represented in the same way. In an automatic system, these implementation details are indistinguishable from the core abstract state of an object.
Oddly enough, although it is brought about through forced duplication of effort (manual specification of the persistence format), it reduced the amount of upgrade code necessary. Because the persistence format was very abstract, you never had to write an upgrader to go from one implementation of behavior for the same persistent data to another. While changing persistence can be frequent, changing implementation is almost by definition more frequent.
I think that's a relatively complete summary of the advantages of manual persistence work, although I'd love to hear comments upon it.
How can we replicate these advantages?
I think that an important first step is to find some simple, lightweight way to completely express the necessary information for the persistence of an object. Even if we still use Pickle to store this data, an explicit specification of what it should look like would be a good mental exercise for us. It would also provide a means to test upgrading and to represent the format of old versions without having to copy their entire implementation. In short, the "schema" that Twisted World provided and New PB is about to provide again. The outstanding work in my sandbox on indexed and referenced properties in Atop is an important first step here.
We also need some critical data to be stored in the database that can't be exported, re-loaded, or otherwise passed through some external crutch versioning mechanism. We need to care about our core data's dependability as we move forward.
We also need to decide what kinds of data we really care about. There are certain aspects of the Quotient application which are developing so fast that it's impossible to effectively represent them persistence-wise, and it would really be a waste of time. Such objects should probably never be persisted in the first place - just provide an 'experimental' flag or somesuch, indicating that the object should never touch an on-disk database. When this becomes burdensome, the programmer can un-set it and manually start performing updates.
There's more to say, but this has already been quite a ramble! I hope that you've enjoyed reading it. Please keep in mind that I would like feedback and more ideas about how to perform the transitions I've suggested over a relatively large existing system. (Quickly, of course. And cheaply. ^_^)
Until next time.
This is hastily transcribed almost verbatim from one side of an IM
conversation, so apologies if I left something important out.
I just read Jason Asbahr's blog. He has a great link to a speech by Michael Crichton about the state of the environmentalist movement.
In keeping with his religious theme, I think his argument backs up my general feeling that the end of all life is really the only solution to the eternal questions that humanity struggles with.
It would be nice if Crichton put together a website to factually verify some of those claims he's making - such as the claim that DDT is harmless and its ban has killed millions worldwide - but the specific examples weren't terribly interesting to me. I'm willing to accept the possibility of their truth on the basis of numerous other examples that I've heard better factual evidence for.
Not a big environmentalist myself, I still find the main thrust of his speech really unnerving, though I don't think he approached it directly enough - how do you mobilize the public to act upon vague, potentially unpleasant, and most of all, changing information?
How do you sensationalize, propagandize, and thereby positively politicize rational thought? It seems like you'd have to in a mass-media democratic society if you want it to be allowed to have any impact. Ayn Rand tried pretty hard, and look where that got her. I've talked about what's wrong with objectivism before. Not all of it is due to the problem of mass-media exposure and misinterpretation, but the end result is the same - one set of assumptions get substituted for another, and you repurpose a few key buzzwords - "reason", "truth", and "love" to name a few - to mean something different than what they did in your previous ontological framework, and you're done. (Extra credit: find out what these three words mean in your favorite belief system, and how their meanings differ from the dictionary definitions.)
I've started to think lately that you really can't, that it's impossible, and that people have to be inculcated with a critical spirit from birth or they will never really acquire one. They may sway from the Church of Christ to the Church of Rand to the anti-church of LaVey, but they'll continue holding extreme positions with re-heated fallacious arguments without questioning of any of their beliefs until they have a crisis, at which point they question ALL of them.
More optimistically, It doesn't really matter that you can't discuss subtle and complicated issues in a broad public forum. Humanity staggers on anyway. The important thing is to prevent the criminalization of rational discourse and inquiry. That conflict is substantially easier to dumb down and sensationalize. At any given time, in the USA anyway, there is inevitably an unpopular thinker whose punishment is far enough out of proportion with his (thought)crime that the public can express some indignant outrage about allowing him to think whatever he's thinking (usually, with the implicit subtext of "as long as I don't have to listen to it").
The wonderful thing about rational thought is that it actually allows you to manipulate the natural world more effectively, which makes it less important that the dominant forces in society care about your methods and beliefs - as long as they're not actively working against you.
And that's why I'm a member of the Electronic Frontier Foundation. :-)
I just read Jason Asbahr's blog. He has a great link to a speech by Michael Crichton about the state of the environmentalist movement.
In keeping with his religious theme, I think his argument backs up my general feeling that the end of all life is really the only solution to the eternal questions that humanity struggles with.
It would be nice if Crichton put together a website to factually verify some of those claims he's making - such as the claim that DDT is harmless and its ban has killed millions worldwide - but the specific examples weren't terribly interesting to me. I'm willing to accept the possibility of their truth on the basis of numerous other examples that I've heard better factual evidence for.
Not a big environmentalist myself, I still find the main thrust of his speech really unnerving, though I don't think he approached it directly enough - how do you mobilize the public to act upon vague, potentially unpleasant, and most of all, changing information?
How do you sensationalize, propagandize, and thereby positively politicize rational thought? It seems like you'd have to in a mass-media democratic society if you want it to be allowed to have any impact. Ayn Rand tried pretty hard, and look where that got her. I've talked about what's wrong with objectivism before. Not all of it is due to the problem of mass-media exposure and misinterpretation, but the end result is the same - one set of assumptions get substituted for another, and you repurpose a few key buzzwords - "reason", "truth", and "love" to name a few - to mean something different than what they did in your previous ontological framework, and you're done. (Extra credit: find out what these three words mean in your favorite belief system, and how their meanings differ from the dictionary definitions.)
I've started to think lately that you really can't, that it's impossible, and that people have to be inculcated with a critical spirit from birth or they will never really acquire one. They may sway from the Church of Christ to the Church of Rand to the anti-church of LaVey, but they'll continue holding extreme positions with re-heated fallacious arguments without questioning of any of their beliefs until they have a crisis, at which point they question ALL of them.
More optimistically, It doesn't really matter that you can't discuss subtle and complicated issues in a broad public forum. Humanity staggers on anyway. The important thing is to prevent the criminalization of rational discourse and inquiry. That conflict is substantially easier to dumb down and sensationalize. At any given time, in the USA anyway, there is inevitably an unpopular thinker whose punishment is far enough out of proportion with his (thought)crime that the public can express some indignant outrage about allowing him to think whatever he's thinking (usually, with the implicit subtext of "as long as I don't have to listen to it").
The wonderful thing about rational thought is that it actually allows you to manipulate the natural world more effectively, which makes it less important that the dominant forces in society care about your methods and beliefs - as long as they're not actively working against you.
And that's why I'm a member of the Electronic Frontier Foundation. :-)
So, I'm awake again at 3AM. I've been sleeping a lot lately to catch up on
my stress-induced sleep deprivation before, so it's not entirely unexpected,
but the real reason I'm awake has me a little worried about my continued
presence here in this building.
At about 1:30AM every other day, my apartment fills with a really peculiar smell and it doesn't go away for hours. I originally thought it was cigarette smoke, but it doesn't reliably give me the same headache that nicotine does. Likewise, I don't think it's pot. Also, I can't imagine that my neighbor's smoking habits would be quite so deterministic.
While it doesn't make me sick or give me horrible headaches, it's definitely distracting enough to keep me awake. Although I was awake today when it started, it can wake me up. I'm going to mention it to the landlord tomorrow, but I'm curious: has anybody out there experienced strange, periodic smells in an apartment building? What might it be? Should I be seriously concerned for my health? My working hypothesis, taking into account the periodicity and my difficulty in identifying it, is that the building has a poorly insulated incinerator.
At about 1:30AM every other day, my apartment fills with a really peculiar smell and it doesn't go away for hours. I originally thought it was cigarette smoke, but it doesn't reliably give me the same headache that nicotine does. Likewise, I don't think it's pot. Also, I can't imagine that my neighbor's smoking habits would be quite so deterministic.
While it doesn't make me sick or give me horrible headaches, it's definitely distracting enough to keep me awake. Although I was awake today when it started, it can wake me up. I'm going to mention it to the landlord tomorrow, but I'm curious: has anybody out there experienced strange, periodic smells in an apartment building? What might it be? Should I be seriously concerned for my health? My working hypothesis, taking into account the periodicity and my difficulty in identifying it, is that the building has a poorly insulated incinerator.