Pondering Python Path Programming Problems

Most Python programmers are at least vaguely aware of sys.path, PYTHONPATH, and the effect they have on importing modules.  However, there's a lot of confusion about how to use them properly, and how powerful these concepts can be if you know how to apply them.  Twisted - and in particular the plugin system - make very nuanced use of the python path, which can sometimes make things that use them a bit hard to explain, since there isn't a well-defined common terminology or good library support for working with paths, except to the extent that they are used by importers.

This article is really about two things: the general concept of paths, and the Twisted module "twisted.python.modules", which provides some specific implementations of my ideas about the python path.

First of all, why should you care about python paths?  To put it simply, because very bizarre problems can result if you use them incorrectly.  Also, you need to know about them in order to use Twisted's plugin system effectively, and of course you want to use Twisted, right?  :)

What kind of problems?  Even very popular, well-regarded Python packages by very experienced Python programmers sometimes mess this up pretty badly.  Here's a simple example of what can go wrong with a package you probably know of, the Python Imaging Library:
>>> import Image
>>> import PIL.Image
>>> img = PIL.Image.Image()
>>> Image.__file__
'/usr/lib/python2.5/site-packages/PIL/Image.pyc'
>>> PIL.Image.__file__
'/usr/lib/python2.5/site-packages/PIL/Image.pyc'
Here we can see that you can import PIL's "Image" module as either "PIL.Image" or simply "Image".  Both these modules are loaded from the same file.  On the face of it, this is simply a convenience.  But let's dig deeper:
>>> PIL.Image == Image
False
The modules aren't the same object!  This has some nasty practical repercussions:
>>> isinstance(img, Image.Image)
False
For example, Image objects created from one of these PIL modules do not register as instances from the other, even though they're all the same code.  Worse yet, this mistake can become "sticky" if you use them along with a module like pickle, which carries the module and class name into the data:
>>> from cPickle import dumps
>>> img2 = Image.Image()
>>> dumps(img)
"(iPIL.Image\nImage\n ...
>>> dumps(img2)
"(iImage\nImage\n ...
Many Python features and packages depend on matching types.  Zope Interface, for example, will not let you use adapters for one Image type for the other, the objects will not compare equivalent even if they really are, and so on.  And none of this is a bug in the code!  Why does it happen?

PIL is a package; that is, a directory with Python source code and an "__init__.py" in it, named "PIL".  However, it also installs a ".pth" file as part of its installation.  ".pth" files are one way to add entries to your sys.path.  This particular one adds the "PIL" directory to your path, which means it can be loaded from two entries: as a package, from your "site-packages" directory.

This isn't to pick on PIL or the Effbot; I've seen lots of projects which have a "lib" directory with an __init__.py and change its name at installation time, or inconsistently reference subpackages with relative and absolute imports, or do any number of things which are just as bad.  I hope that I've convinced you not to do the same thing with your project, but I won't dwell on the problem here, since I have a solution handy.

Unless you already know what is going on (although I'm sure many of you reading this already do), this can be a bit confusing to figure out.  You can use twisted.python.modules to ask this question rather directly.  Here's how:
>>> from twisted.python.modules import getModule
>>> imageModule = getModule("Image")
>>> pilImageModule = getModule("PIL.Image")
>>> imageModule.pathEntry
PathEntry<FilePath('/usr/lib/python2.5/site-packages/PIL')>
>>> pilImageModule.pathEntry
PathEntry<FilePath('/usr/lib/python2.5/site-packages')>
Here we're asking twisted.python.modules to give us objects that represent metadata about two modules, without actually loading them.  The attribute here is the 'pathEntry' attribute, which tells us what entry on sys.path the module would be loaded from, if it's imported.
>>> import sys
>>> pilImageModule.isLoaded()
False
>>> imageModule.isLoaded()
False
>>> 'PIL.Image' in sys.modules
False
>>> 'Image' in sys.modules
False
Look, no modules!

Of course, if we wanted to load those modules, it's easy enough:
>>> pilImageModule.load()
<module 'PIL.Image' from '/usr/lib/python2.5/site-packages/PIL/Image.pyc'>
>>> imageModule.load()
<module 'Image' from '/usr/lib/python2.5/site-packages/PIL/Image.pyc'>
You can also get lists of modules.  For example, you can see that the list of modules in the "PIL" package is suspiciously similar to the list of top-level modules that comes from the path entry
where the "Image" module was loaded:
>>> pilModule = getModule("PIL")
>>> pprint(list(pilModule.iterModules())[:5])
[PythonModule<'PIL.ArgImagePlugin'>,
 PythonModule<'PIL.BdfFontFile'>,
 PythonModule<'PIL.BmpImagePlugin'>,
 PythonModule<'PIL.BufrStubImagePlugin'>,
 PythonModule<'PIL.ContainerIO'>]
>>> pprint(list(imageModule.pathEntry.iterModules())[:5])
[PythonModule<'ArgImagePlugin'>,
 PythonModule<'BdfFontFile'>,
 PythonModule<'BmpImagePlugin'>,
 PythonModule<'BufrStubImagePlugin'>,
 PythonModule<'ContainerIO'>]
As you might imagine, the ability to list modules and load the ones that seem interesting is a great way to load plugins - and that's exactly how Twisted's plugin system is implemented.  While the plugin system itself is a topic for another post (or perhaps you could just read the documentation) the way it finds plugins is interesting.

For example, let's take a look at the list of Mantissa plugin modules I have installed:
>>> xmplugins = getModule('xmantissa.plugins')
>>> pprint(list(xmplugins.iterModules()))
[PythonModule<'xmantissa.plugins.adminoff'>,
 PythonModule<'xmantissa.plugins.baseoff'>,
 PythonModule<'xmantissa.plugins.free_signup'>,
 PythonModule<'xmantissa.plugins.offerings'>]
This simple query is actually an incomplete list.  It's just the modules that come with Mantissa itself.  Python has a special little-known rule when loading modules from packages, and twisted.python.plugins honors it: if there is a special variable called "__path__" in a package, it is a list of path names to load modules from.  However, twisted.python.plugins doesn't load modules unless you ask it to, so it can't determine the value of that attribute.  As it so happens, twisted.plugins uses the __path__ attribute in order to allow you to keep your development installations separate, so twisted.python.plugins can't determine all the places you might need to look for plugins without some help.  Let's just load that package so we can look at its __path__ attribute:
>>> xmplugins.load()
<module 'xmantissa.plugins' from '/home/glyph/Projects/Divmod/trunk/Mantissa/xmantissa/plugins/__init__.pyc'>
Now that we've loaded it, let's have a look at that list:
>>> pprint(list(xmplugins.iterModules()))
[PythonModule<'xmantissa.plugins.adminoff'>,
 PythonModule<'xmantissa.plugins.baseoff'>,
 PythonModule<'xmantissa.plugins.free_signup'>,
 PythonModule<'xmantissa.plugins.offerings'>,
 PythonModule<'xmantissa.plugins.mailoff'>,
 PythonModule<'xmantissa.plugins.radoff'>,
 PythonModule<'xmantissa.plugins.sineoff'>,
 PythonModule<'xmantissa.plugins.hyperbolaoff'>,
 PythonModule<'xmantissa.plugins.imaginaryoff'>,
 PythonModule<'xmantissa.plugins.blendix_offering'>,
 PythonModule<'xmantissa.plugins.billed_signup'>,
 PythonModule<'xmantissa.plugins.billoff'>,
 PythonModule<'xmantissa.plugins.derivoff'>]

That's my full list of Mantissa plugins, including my super secret Divmod proprietary plugins.

This list is generated because plugins packages use a feature (which was previously kind of a gross hack but will be an officially supported feature of the next version of Twisted) to set their path to every directory with the same name as the plugin package which is not also a package on your python path.  In other words, if you have 2 sys.path entries, a/ and b/, and one package, x.plugins, in b/x/plugins/__init__.py with this trick in it, then if you have a file b/x/plugins/foo.py, it will be considered to contain the module "x.plugins.foo".  This requires that you do not have a file b/x/__init__.py or b/x/plugins/__init__.py.  If you do, this hack will treat the two paths the same way that Python does: duplicate packages in your path, so the package in a/ is loaded and the package in b/ is ignored.

The distinction between packages and path entries is why all the Twisted and Divmod projects conventionally have capitalized directory names but lowercase package names.  "Twisted" is where your path entry should point; "twisted" is the python package that is loaded from that path entry.  "Twisted" should never have an __init__.py in it.  "twisted" always should.  This goes the same for "Axiom" and "axiom", "Mantissa" and (the unfortunately named) "xmantissa".  You will sometimes encounter other examples of this style of naming floating around the web.

When using Twisted and Divmod infrastructure, keeping this distinction is clear is critical, because otherwise it is difficult to develop plugins independently.  You probably don't want to copy your development plugins into your Twisted installation - they're part of your source repository, after all, not ours.  However, keeping the distinction clear in your mind will avoid lots of obscure problems with duplicate classes and naming, so it's generally a good idea even if you don't like our naming conventions.

Please let me know in the comments which parts of this post you found useful, if any.  I know it's a bit rambling, and covers a number of different topics, some of which may be obvious and some of which might be inscrutable.  I've experienced quite a bit of confusion when talking to other python programmers about this stuff, but I'm not sure if it was my awkward explanation of Twisted's plugin system or some inherent issue in Python's path management.

Not Just The Faithful

As I've said before, Microsoft Windows Vista is a terrible disaster which I hope I never have to deal with in any capacity, professional or otherwise.  I suspect that it is inevitable, but I will resist it for as long as possible.

The FSF has a campaign, "BADVISTA", to educate end-users about the ways in which Vista is limiting your freedom more aggressively than any other commercial software product to date.  Unfortunately this can sometimes sound a bit ... overdramatic, even if it is pretty much all true.  For example, a prominently featured quotation:
Windows Vista includes an array of “features” that you don't want. These features will make your computer less reliable and less secure. They'll make your computer less stable and run slower. They will cause technical support problems. They may even require you to upgrade some of your peripheral hardware and existing software. And these features won't do anything useful. In fact, they're working against you.
I recently had the experience of talking to a Regular User in a consumer electronics store about his vista "upgrade".  His "computer guy" had told him that Vista was like XP, but better.  Little did he know that the "better" would mean that the computer ran visibly slower, had reduced functionality, and required the purchase of newer, more expensive hardware.

Of course, I gave him my rant about the other reasons he shouldn't have upgraded, and the poor guy turned white as a sheet.  I don't think he's going to be purchasing any more "upgrades" from his "computer guy".

But, what does the other side have to say about this fancy new operating system?  Surely there are some worthwhile new conveniences that we are trading this freedom for?  Let's see what one ex-Microsoft employee and prominent Windows developer has to say about it:
"I've been using Vista on my home laptop since it shipped, and can say with some conviction that nobody should be using it as their primary operating system -- it simply has no redeeming merits to overcome the compatibility headaches it causes. Whenever anyone asks, my advice is to stay with Windows XP (and to purchase new systems with XP preinstalled)."
... and there you have it.  Friends don't let friends use Vista.

Pet Peeve

The word "depreciate" means "to lessen the price or value of".  This is an accounting jargon term referring to the process by which assets lose value over time.  It is pronounced 'Dee Pree Shee Ate".

The word "deprecate" means "to express disapproval of" or "to urge reasons against; protest against".  This is a programming jargon term describing the process by which APIs become less favorable over time.  It is pronounced "Deh Preh Kayt".

These words, while they have similar meanings, are not synonyms.  Please do not confuse them, especially when using their jargon senses.  It sounds like nails on a chalkboard to me, having worked on accounting software.  I would like to be able to use phrases like "a deprecated depreciation function" without eliciting bewilderment.

Both Java and Python consistently use "@deprecated", and "DeprecationWarning".  English usage of these terms may be shifting, but "DepreciationWarning" or "@depreciated" will still get you runtime or compiler errors, so please stick to "deprecate" consistently while talking about code.

Thank you.

Mindful Link Propagation

It occurs to me that there may still be a few Python people who read this blog but have not yet discovered JP Calderone's.

If you are such a person, he just did an excellent write-up of the practical implications of Python's rich comparison operators.  Check it out.

Functional Functions and the Python Singleton Unpattern

Have you ever written a module that looked like this?
subscribers = []

def addSubscriber(subscriber):
    subscribers.append(subscriber)

def publish(message):
    for subscriber in subscribers:
        subscriber.notify(message)
And then used it like this?
from publisher import publish

class worker:
    def work(self):
        publish(self)
I've done this many times myself.

I used to think that this was the "right" way to implement Singletons in Python.  Other languages had static members and synchronized static accessors and factory methods; all kinds of rigamarole to achieve this effect, but Python simply had modules.

Now, however, I realize that there is no "right" way to implement Singleton in Python, because singletons are simply a bad thing to have.  As Wikipedia points out, "It is also considered an anti-pattern since it is often used as a euphemism for global variable."

The module above is brittle, and as a result, unpleasant to test and extend.

It's difficult to test because the call to "publish" cannot be indirected without monkeying around with the module's globals - generally recognized to be poor style, and prone to errors which will corrupt later, unrelated tests.

It makes code that interacts with it difficult to test, because while you can temporary mangle global variables in the most egregious of whitebox tests, tests for code that is further away shouldn't need to know about the implementation detail of "publish".  Furthermore, code which adds subscribers to the global list will destructively change the behavior of later tests (or later code, if you try to invoke your tests in a running environment, since we all know running environments are where the interesting bugs occur).

It's difficult to extend because there is no explicit integration point with 'publish', and all instances share the same look-up.  If you want to override the behavior of "work" and send it to a different publisher, you can't call to the superclass's implementation.

Unfortunately, this probably doesn't seem particularly bad, because bad examples abound.  It's just the status quo.  Twisted's twisted.python.log module is used everywhere like this.  The standard library's sys.path, sys.stdin/out/err, warnings.warn_explicit, and probably a dozen examples I can't think of off the top of my head, all work like this.

And there's a good reason that this keeps happening.  Sometimes, you feel as though your program really does need a "global" registry for some reason; you find yourself wanting access to the same central object in a variety of different places.  It seems convenient to have it available, and it basically works.

Here's a technique for implementing that convenience, while still allowing for a clean point of integration with other code.

First, make your "global" thing be a class.
class Publisher:
    def __init__(self):
        self.subscribers = []

    def addSubscriber(self, subscriber):
        self.subscribers.append(subscriber)

    def publish(self, message):
        for subscriber in self.subscribers:
            subscriber.notify(message)

thePublisher = Publisher()
Second, decide and document how "global" you mean.  Is it global to your process?  Global to a particular group of objects?  Global to a certain kind of class?  Document that, and make sure it is clear who should use the singleton you've created.  At some point in the future, someone will almost certainly come along with a surprising requirement which makes them want a different, or wrapped version of your global thing,  Documentation is always important, but it is particularly important when dealing with globals, because there's really no such thing as completely global, and it is difficult to determine from context just how global you intend for something to be.

Third, and finally, encourage using your singleton by using it as a default, rather than accessing it directly.  For example:
from publisher import thePublisher

class Worker:
    publisher = thePublisher

    def work(self):
        self.publisher.publish(self)
In this example, you now have a clean point of integration for testing and extending this code.  You can make a single Worker instance, and change its "publisher" attribute before calling "work".  Of course, if you're willing to burn a whole extra two lines of code, you can make it an optional argument to the constructor of Worker.  If you decide that in fact, your publisher isn't global at all, but system-specific, this vastly decreases the amount of code you have to change.

Does this mean you should make everything into objects, and never use free functions?  No.  Free functions are fine, but functions in Python are for functional programming.  The hint is right there in the name.  If you are performing computations which return values, and calling other functions which do the same thing, it makes perfect sense to use free functions and not bog yourself down with useless object allocations and 'self' arguments.

Once you've started adding mutable state into the mix, you're into object territory.  If you're appending to a global list, if you're setting a global "state" variable, even if you're writing to a global file, it's time to make a class and give it some methods.