Encoding.

Wednesday August 03, 2005
Mr. Bicking wants to change his default encoding. Since there is some buzz about this I figure it would be a good opportunity to answer something that has already emerged as a FAQ during Axiom's short life, about its treatment of strings.

Axiom does not have strings. It has 2 attribute types that look suspiciously like strings: text() and bytes().

However, 'text()' does not convert a Python str to text for you, and never, ever will. This is not an accident, and it is not because guessing at this sort of automatic conversion is hard. Lots of packages do it, including Python - str(unicode(x)) does do something, after all.

However, in my mind, that is an unfortunate coincidence, and I avoid using the default encoding anywhere I can. Let me respond directly to part of his post, point-by-point:
Are people claiming that there should be no default encoding?
That's what I would say, yes. The default encoding is a process-global variable that sets you up for a lot of confusion, since encoding is always context and data-type dependent. Occasionally I get lazy and use the default encoding, since I know that regardless of what it is it probably has ASCII as a subset (and I know that my data is something like an email address or a URL which functionally must be ASCII), but this is not generally good behavior.
As long as we have non-Unicode strings, I find the argument less than convincing, and I think it reflects the perspective of people who take Unicode very seriously, as compared to programmers who aren't quite so concerned but just want their applications to not be broken; and the current status quo is very deeply broken.
I believe that in the context of this discussion, the term "string" is meaningless. There is text, and there is byte-oriented data (which may very well represent text, but is not yet converted to it). In Python types, Text is unicode. Data is str. The idea of "non-Unicode text" is just a programming error waiting to happen.

The fact that English text, the sort that programmers commonly use to converse with, code with, identify network endpoints with and test program input with, looks very similar in its decoded and encoded forms, is an unfortunate and misleading phenomenon. It means that programs are often very confused about what kind of data they are processing but appear to work anyway, and make serious errors only when presented with input which differs in encoded and decoded form.

SQLite unfortunately succumbs to this malady as well, although at least they tried. Right now we are using its default COLLATE NOCASE for case-insensitive indexing and searches. This is defined according to the docs as "The same as binary, except the 26 upper case characters used by the English language are folded to their lower case equivalents before the comparison is performed." Needless to say, despite SQLite's pervasive use of Unicode throughout the database, that is not how you case-insensitively compare Unicode strings.

Using the default encoding and Unicode only worsens this. Now the program appears to work, and may in fact be correct in the face of non-English, or even non-human-language input, but breaks randomly and mangles data when moved to a different host environment with a different locally-specified default encoding. "Everybody use UTF-8" isn't a solution either; forgetting the huge accidental diversity in this detail of configuration, In Asian countries especially, the system's default encoding implies certain things to a lot of different software. It would be extremely unwise to force your encoding choice upon everyone else.

I don't think that Ian has an entirely unreasonable position; the only reason I know anything about Unicode at all was that I was exposed to a lot of internationalization projects during my brief stint in the game industry, and mostly on projects that had taken multilingual features into account from the start.

The situation that I describe, where text and bytes are clearly delineated and never the twain shall meet, is a fantasy-land sort of scenario. Real-world software still handles multilingual text very badly, and encoding and decoding properly within your software does no good and is a lot of extra work when you're interfacing with a system that only deals with code points 65-90. Forcing people to deal with this detail is often viewed as arrogance on the part of the system designer, and in many scenarios the effort is wasted because the systems you're interfacing with are already broken.

Still, I believe that forcing programmers to consider encoding issues whenever they have to store some text is a very useful exercise, since otherwise - this is important - foreign language users may be completely unable to use your application. What is to you simply a question-mark or box where you expected to see an "é" is, to billions of users the world over, a page full of binary puke where they expected to see a letter they just typed. Even pure English users can benefit: consider the difference between and . Finally, if you are integrating with a crappy, non-Unicode-aware system (or a system that handles Unicode but extremely poorly) you can explicitly note the nature of its disease and fail before passing it data outside the range (usually ASCII) that you know it can handle.

Consider the other things that data - regular python 'str' objects - might represent. Image data, for example. If there were a culture of programmers that expected image data to always be unpacked 32-bit RGBA byte sequences, it would be very difficult to get the Internet off the ground; image formats like PNG and JPEG have to be decoded before they are useful image data, and it is very difficult to set a 'system default image format' and have them all magically decoded and encoded properly. If we did have sys.defaultimageformat, or sys.defaultaudiocodec, we'd end up with an upsetting amount of multi-color snow and shrieking noise on our computers.

That is why Axiom does not, will not, and can not, automatically decode and encode your strings for you. Your string could be a chunk of oscilloscope data, and there is no Unicode encoding for that. If you need to store it, store it unencoded, as data, and load it and interpret it later. There are good reasons why people use different audio and image codecs; there are perhaps less good, but nevertheless valid reasons why people use different Unicode codecs.

To avoid a similar common kind of error, I don't think that Axiom is going to provide a 'float' type before we've implemented a 'money' type - more on why money needs to be encoded and decoded just like Unicode in my next installment :)