Tiny Flag Day

Thursday August 11, 2005

If anyone out there is using Q2Q, the divmod.net server is getting upgraded from the code in the Quotient repository to the code in the Vertex project, in the new Divmod repository mentioned in earlier posts. This will make it slightly incompatible with the divmod.com server for the next few weeks, as upgrading that is more of a significant issue.

It will still mostly work, but you'll see a lot of tracebacks and none of the NAT-traversal code is compatible any more.

This will probably happen 2 or 3 more times before the protocol is totally stable. (There are backwards compatibility mechanisms implemented, but at the small scale of deployment we're at now, they're hardly worth using.)

Encoding.

Wednesday August 03, 2005

Mr. Bicking wants to change his default encoding. Since there is some buzz about this I figure it would be a good opportunity to answer something that has already emerged as a FAQ during Axiom's short life, about its treatment of strings.

Axiom does not have strings. It has 2 attribute types that look suspiciously like strings: text() and bytes().

However, 'text()' does not convert a Python str to text for you, and never, ever will. This is not an accident, and it is not because guessing at this sort of automatic conversion is hard. Lots of packages do it, including Python - str(unicode(x)) does do something, after all.

However, in my mind, that is an unfortunate coincidence, and I avoid using the default encoding anywhere I can. Let me respond directly to part of his post, point-by-point:

Are people claiming that there should be no default encoding?

That's what I would say, yes. The default encoding is a process-global variable that sets you up for a lot of confusion, since encoding is always context and data-type dependent. Occasionally I get lazy and use the default encoding, since I know that regardless of what it is it probably has ASCII as a subset (and I know that my data is something like an email address or a URL which functionally must be ASCII), but this is not generally good behavior.

As long as we have non-Unicode strings, I find the argument less than convincing, and I think it reflects the perspective of people who take Unicode very seriously, as compared to programmers who aren't quite so concerned but just want their applications to not be broken; and the current status quo is very deeply broken.

I believe that in the context of this discussion, the term "string" is meaningless. There is text, and there is byte-oriented data (which may very well represent text, but is not yet converted to it). In Python types, Text is unicode. Data is str. The idea of "non-Unicode text" is just a programming error waiting to happen.

The fact that English text, the sort that programmers commonly use to converse with, code with, identify network endpoints with and test program input with, looks very similar in its decoded and encoded forms, is an unfortunate and misleading phenomenon. It means that programs are often very confused about what kind of data they are processing but appear to work anyway, and make serious errors only when presented with input which differs in encoded and decoded form.

SQLite unfortunately succumbs to this malady as well, although at least they tried. Right now we are using its default COLLATE NOCASE for case-insensitive indexing and searches. This is defined according to the docs as "The same as binary, except the 26 upper case characters used by the English language are folded to their lower case equivalents before the comparison is performed." Needless to say, despite SQLite's pervasive use of Unicode throughout the database, that is not how you case-insensitively compare Unicode strings.

Using the default encoding and Unicode only worsens this. Now the program appears to work, and may in fact be correct in the face of non-English, or even non-human-language input, but breaks randomly and mangles data when moved to a different host environment with a different locally-specified default encoding. "Everybody use UTF-8" isn't a solution either; forgetting the huge accidental diversity in this detail of configuration, In Asian countries especially, the system's default encoding implies certain things to a lot of different software. It would be extremely unwise to force your encoding choice upon everyone else.

I don't think that Ian has an entirely unreasonable position; the only reason I know anything about Unicode at all was that I was exposed to a lot of internationalization projects during my brief stint in the game industry, and mostly on projects that had taken multilingual features into account from the start.

The situation that I describe, where text and bytes are clearly delineated and never the twain shall meet, is a fantasy-land sort of scenario. Real-world software still handles multilingual text very badly, and encoding and decoding properly within your software does no good and is a lot of extra work when you're interfacing with a system that only deals with code points 65-90. Forcing people to deal with this detail is often viewed as arrogance on the part of the system designer, and in many scenarios the effort is wasted because the systems you're interfacing with are already broken.

Still, I believe that forcing programmers to consider encoding issues whenever they have to store some text is a very useful exercise, since otherwise - this is important - foreign language users may be completely unable to use your application. What is to you simply a question-mark or box where you expected to see an "é" is, to billions of users the world over, a page full of binary puke where they expected to see a letter they just typed. Even pure English users can benefit: consider the difference between ☺ and ☣. Finally, if you are integrating with a crappy, non-Unicode-aware system (or a system that handles Unicode but extremely poorly) you can explicitly note the nature of its disease and fail before passing it data outside the range (usually ASCII) that you know it can handle.

Consider the other things that data - regular python 'str' objects - might represent. Image data, for example. If there were a culture of programmers that expected image data to always be unpacked 32-bit RGBA byte sequences, it would be very difficult to get the Internet off the ground; image formats like PNG and JPEG have to be decoded before they are useful image data, and it is very difficult to set a 'system default image format' and have them all magically decoded and encoded properly. If we did have sys.defaultimageformat, or sys.defaultaudiocodec, we'd end up with an upsetting amount of multi-color snow and shrieking noise on our computers.

That is why Axiom does not, will not, and can not, automatically decode and encode your strings for you. Your string could be a chunk of oscilloscope data, and there is no Unicode encoding for that. If you need to store it, store it unencoded, as data, and load it and interpret it later. There are good reasons why people use different audio and image codecs; there are perhaps less good, but nevertheless valid reasons why people use different Unicode codecs.

To avoid a similar common kind of error, I don't think that Axiom is going to provide a 'float' type before we've implemented a 'money' type - more on why money needs to be encoded and decoded just like Unicode in my next installment :)

Seventh System Effect

Friday July 29, 2005

It's somewhat official now, so I guess I have to announce it: Divmod is doing a massive refactoring of our application, starting with the database. Work began a week ago in my Quotient sandbox, and has been continuing around the clock since then.

Progress is now visible in a more public location:

the unified Divmod SVN repository at http://divmod.org/svn/Divmod/trunk/

I would have said "rewrite" rather than "refactor", but of course Everybody knows that's stupid. Plus, we are mostly migrating our old code base and cleaning it up along the way; the only component getting a complete rewrite is Atop - the rewrite of which is so fundamentally better that we came up with a better name: "Axiom".

To clarify the naming situation: the new Divmod repository has Axiom and a new version of Mantissa, Python package name 'xmantissa' to avoid package name conflicts during the transition period. eventually it will contain a package for Quotient, Sigma, and several other things. During the transition the Python package names will all start with 'x' but I will still refer to them by their project names, since the older projects will go away and the module names will eventually change.

Initially I was very concerned when we began the experiment that lead to this code revolution. It began, as bad ideas are wont to do, as a joke. I mentioned some client work that we are doing to JP, (who had already been rewriting some things) and my difficulty in choosing an appropriate persistence solution given some of the maintenance issues we'd been having with Atop.

Exasperated after an hour of discussion, JP said, "Why don't you just use SQLite". Now, I'd looked into it some time ago and (the ostensible punchline of the joke) it was garbage. However, that was SQLite version 1, not SQLite3, which has a different API and several critical features that made it considerably more appropriate to our tasks.

A few hours later I had a working prototype of maybe half of the functionality from our existing database. I was suitably impressed; SQLite was giving us all the benefits of SQL (ad-hoc queries and indexing, relational operations on data, a fast query engine) without any of the drawbacks (difficult to customize, unportable server, fragile and time-consuming deployment). I realized that we had got something radical on multiple levels.

In the past I've been very conservative about telling people when and whether to use Divmod's open-source released software, With this new system, I say: jump in. Use it. Only a week into implementation it might be a bit premature to launch a production system with it just yet, but indications are very positive that we will be able to do just that within the month.

The code is shorter, clearer to read, easier to maintain, and the database is Pickle Free℠.

(Okay, it's not really a service mark, but it should be. Pickle is the winner for causing problems for us.)

We are building from our experience with 5 previous persistence systems, 3 previous plug-in frameworks, and 4 previous authentication databases. A curious side-effect of all that experience - and the effect that the title is referring to - is that certain development methodology concepts become irrelevant. Most notably, "YAGNI", from XP, is no longer of any use: we know exactly how much extensibility we need. At every point in implementing this system we have known whether to fuse a component together because we'd built unnecessary additional complexity into previous systems, and where to use a plug-in architecture because we'd needed to inject ugly code into the middle of a monolithic routine.

As a result, where our architecture was heavily monolithic before, now it is almost entirely composed of plugins. It is so plugin-happy, in fact, that there is a database with Service plugins in it, which activate when the database is started from twistd; it contains its own configuration, including port-numbers, so nothing need live in a text configuration file. The web application system is built around this as well; so there is a plugin lookup for invoking raw nevow IResource implementors without sessions (for example, for XML APIs), then IResource implementors which do require a session, then IResource implementors which are specific to a particular user, and finally Fragment instances which plug into a generic hierarchical navigation system. At each level there is a distinct and clear place to put new plugin code, and large portions of it are self-similar. For example, you can install the hierarchical navigation both onto a public site and onto a user's private application pages, since the "web server" implementation is an IResource which can be installed either onto a toplevel database, or any user's personal database.

Oh, and did I mention - LivePage support built right in?

The net result of this is that you can build themeable, multi-user web applications with the code that's in SVN right now. The example isn't visually appealing, but the code is nice, and it's composed from a stack of plugins.

I'm very excited about the possibilities of what we'll be transforming our system and our application into within the next few weeks. I'd like to invite everyone who has been interested in Divmod's open source work in the past to have a look at the new repository, and consider coming to #divmod on irc.freenode.net to look for something to hack on. Considering the higher-level and easier-to-understand nature of the implementation of Axiom vs. Atop, I would also love it if we could find some people to help us document it right from the start.

So - anybody out there looking for an open source project's website to maintain?

Here we go...

Thursday July 28, 2005

Nothing interesting yet, but I figure you all should know about it.

http://divmod.com/users/glyph/blog/

Escape!

Tuesday July 19, 2005

The last release was in may 2004.

Go get it before I change my mind.