The Sororicide Antipattern

Don’t murder your parents or your siblings to get their attributes.

“Composition is better than inheritance.”. This is a true statement. “Inheritance is bad.” Also true. I’m a well-known compositional extremist. There’s a great talk you can watch if I haven’t talked your ear off about it already.

Which is why I was extremely surprised in a recent conversation when my interlocutor said that while inheritance might be bad, composition is worse. Once I understood what they meant by “composition”, I was even more surprised to find that I agreed with this assertion.

Although inheritance is bad, it’s very important to understand why. In a high-level language like Python, with first-class runtime datatypes (i.e.: user defined classes that are objects), the computational difference between what we call “composition” and what we call “inheritance” is a matter of where we put a pointer: is it on a type or on an instance? The important distinction has to do with human factors.

First, a brief parable about real-life inheritance.

You find yourself in conversation with an indolent heiress-in-waiting. She complains of her boredom whiling away the time until the dowager countess finally leaves her her fortune.

“Inheritance is bad”, you opine. “It’s better to make your own way in life”.

“By George, you’re right!” she exclaims. You weren’t expecting such an enthusiastic reversal.

“Well,”, you sputter, “glad to see you are turning over a new leaf”.

She crosses the room to open a sturdy mahogany armoire, and draws forth a belt holstering a pistol and a menacing-looking sabre.

“Auntie has only the dwindling remnants of a legacy fortune. The real money has always been with my sister’s manufacturing concern. Why passively wait for Auntie to die, when I can murder my dear sister now, and take what is rightfully mine!”

Cinching the belt around her waist, she strides from the room animated and full of purpose, no longer indolent or in-waiting, but you feel less than satisfied with your advice.

It is, after all, important to understand what the problem with inheritance is.

The primary reason inheritance is bad is confusion between namespaces.

The most important role of code organization (division of code into files, modules, packages, subroutines, data structures, etc) is division of responsibility. In other words, Conway’s Law isn’t just an unfortunate accident of budgeting, but a fundamental property of software design.

For example, if we have a function called multiply(a, b) - its presence in our codebase suggests that if someone were to want to multiply two numbers together, it is multiply’s responsibility to know how to do so. If there’s a problem with multiplication, it’s the maintainers of multiply who need to go fix it.

And, with this responsibility comes authority over a specific scope within the code. So if we were to look at an implementation of multiply:

def multiply(a, b):
    product = a * b
    return product

The maintainers of multiply get to decide what product means in the context of their function. It’s possible, in Python, for some other funciton to reach into multiply with frame objects and mangle the meaning of product between its assignment and return, but it’s generally understood that it’s none of your business what product is, and if you touch it, all bets are off about the correctness of multiply. More importantly, if the maintainers of multiply wanted to bind other names, or change around existing names, like so, in a subsequent version:

def multiply(a, b):
    factor1 = a
    factor2 = b
    result = a * b
    return result

It is the maintainer of multiply’s job, not the caller of multiply, to make those decisions.

The same programmer may, at different times, be both a caller and a maintainer of multiply. However, they have to know which hat they’re wearing at any given time, so that they can know which stuff they’re still repsonsible for when they hand over multiply to be maintained by a different team.

It’s important to be able to forget about the internals of the local variables in the functions you call. Otherwise, abstractions give us no power: if you have to know the internals of everything you’re using, you can never build much beyond what’s already there, because you’ll be spending all your time trying to understand all the layers below it.

Classes complicate this process of forgetting somewhat. Properties of class instances “stick out”, and are visible to the callers. This can be powerful — and can be a great way to represent shared data structures — but this is exactly why we have the ._ convention in Python: if something starts with an underscore, and it’s not in a namespace you own, you shouldn’t mess with it. So: other._foo is not for you to touch, unless you’re maintaining type(other). self._foo is where you should put your own private state.

So if we have a class like this:

class A(object):
    def __init__(self):
        self._note = "a note"

we all know that A()._note is off-limits.

But then what happens here?

class B(A):
    def __init__(self):
        super().__init__()
        self._note = "private state for B()"

B()._note is also off limits for everyone but B, except... as it turns out, B doesn’t really own the namespace of self here, so it’s clashing with what A wants _note to mean. Even if, right now, we were to change it to _note2, the maintainer of A could, in any future release of A, add a new _note2 variable which conflicts with something B is using. A’s maintainers (rightfully) think they own self, B’s maintainers (reasonably) think that they do. This could continue all the way until we get to _note7, at which point it would explode violently.

So that’s why Inheritance is bad. It’s a bad way for two layers of a system to communicate because it leaves each layer nowhere to put its internal state that the other doesn’t need to know about. So what could be worse?

Let’s say we’ve convinced our junior programmer who wrote A that inheritance is a bad interface, and they should instead use the panacea that cures all inherited ills, composition. Great! Let’s just write a B that composes in an A in a nice clean way, instead of doing any gross inheritance:

class Bprime(object):
    def __init__(self, a):
        for var in dir(a):
            setattr(self, var, getattr(a, var))

Uh oh. Looks like composition is worse than inheritance.

Let’s enumerate some of the issues with this “solution” to the problem of inheritance:

How do we know what attributes Bprime has?
How do we even know what type a is?
How is anyone ever going to grep for relevant methods in this code and have them come up in the right place?

We briefly reclaimed self for Bprime by removing the inheritance from A, but what Bprime does in __init__ to replace it is much worse. At least with normal, “vertical” inheritance, IDEs and code inspection tools can have some idea where your parents are and what methods they declare. We have to look aside to know what’s there, but at least it’s clear from the code’s structure where exactly we have to look aside to.

When faced with a class like Bprime though, what does one do? It’s just shredding apart some apparently totally unrelated object, there’s nearly no way for tooling to inspect this code to the point that they know where self.<something> comes from in a method defined on Bprime.

The goal of replacing inheritance with composition is to make it clear and easy to understand what code owns each attribute on self. Sometimes that clarity comes at the expense of a few extra keystrokes; an __init__ that copies over a few specific attributes, or a method that does nothing but forward a message, like def something(self): return self.other.something().

Automatic composition is just lateral inheritance. Magically auto-proxying all methods¹, or auto-copying all attributes, saves a few keystrokes at the time some new code is created at the expense of hours of debugging when it is being maintained. If readability counts, we should never privilege the writer over the reader.

It is left as an exercise for the reader why proxyForInterface is still a reasonably okay idea even in the face of this criticism.² ↩
Although ironically it probably shouldn’t use inheritance as its interface. ↩

Python Packaging Is Good Now

setup.py is your friend. It’s real sorry about what happened last time.

python programming packaging desiderata Sunday August 14, 2016

Okay folks. Time’s up. It’s too late to say that Python’s packaging ecosystem terrible any more. I’m calling it.

Python packaging is not bad any more. If you’re a developer, and you’re trying to create or consume Python libraries, it can be a tractable, even pleasant experience.

I need to say this, because for a long time, Python’s packaging toolchain was … problematic. It isn’t any more, but a lot of people still seem to think that it is, so it’s time to set the record straight.

If you’re not familiar with the history it went something like this:

The Dawn

Python first shipped in an era when adding a dependency meant a veritable Odyssey into cyberspace. First, you’d wait until nobody in your whole family was using the phone line. Then you’d dial your ISP. Once you’d finished fighting your SLIP or PPP client, you’d ask a netnews group if anyone knew of a good gopher site to find a library that could solve your problem. Once you were done with that task, you’d sign off the Internet for the night, and wait about 48 hours too see if anyone responded. If you were lucky enough to get a reply, you’d set up a download at the end of your night’s web-surfing.

pip search it wasn’t.

For the time, Python’s approach to dependency-handling was incredibly forward-looking. The import statement, and the pluggable module import system, made it easy to get dependencies from wherever made sense.

In Python 2.0¹, Distutils was introduced. This let Python developers describe their collections of modules abstractly, and added tool support to producing redistributable collections of modules and packages. Again, this was tremendously forward-looking, if somewhat primitive; there was very little to compare it to at the time.

Fast forwarding to 2004; setuptools was created to address some of the increasingly-common tasks that open source software maintainers were facing with distributing their modules over the internet. In 2005, it added easy_install, in order to provide a tool to automate resolving dependencies and downloading them into the right locations.

The Dark Age

Unfortunately, in addition to providing basic utilities for expressing dependencies, setuptools also dragged in a tremendous amount of complexity. Its author felt that import should do something slightly different than what it does, so installing setuptools changed it. The main difference between normal import and setuptools import was that it facilitated having multiple different versions of the same library in the same program at the same time. It turns out that that’s a dumb idea, but in fairness, it wasn’t entirely clear at the time, and it is certainly useful (and necessary!) to be able to have multiple versions of a library installed onto a computer at the same time.

In addition to these idiosyncratic departures from standard Python semantics, setuptools suffered from being unmaintained. It became a critical part of the Python ecosystem at the same time as the author was moving on to other projects entirely outside of programming. No-one could agree on who the new maintainers should be for a long period of time. The project was forked, and many operating systems’ packaging toolchains calcified around a buggy, ancient version.

From 2008 to 2012 or so, Python packaging was a total mess. It was painful to use. It was not clear which libraries or tools to use, which ones were worth investing in or learning. Doing things the simple way was too tedious, and doing things the automated way involved lots of poorly-documented workarounds and inscrutable failure modes.

This is to say nothing of the fact that there were critical security flaws in various parts of this toolchain. There was no practical way to package and upload Python packages in such a way that users didn’t need a full compiler toolchain for their platform.

To make matters worse for the popular perception of Python’s packaging prowess², at this same time, newer languages and environments were getting a lot of buzz, ones that had packaging built in at the very beginning and had a much better binary distribution story. These environments learned lessons from the screw-ups of Python and Perl, and really got a lot of things right from the start.

Finally, the Python Package Index, the site which hosts all the open source packages uploaded by the Python community, was basically a proof-of-concept that went live way too early, had almost no operational resources, and was offline all the dang time.

Things were looking pretty bad for Python.

Intermission

Here is where we get to the point of this post - this is where popular opinion about Python packaging is stuck. Outdated information from this period abounds. Blog posts complaining about problems score high in web searches. Those who used Python during this time, but have now moved on to some other language, frequently scoff and dismiss Python as impossible to package, its packaging ecosystem as broken, PyPI as down all the time, and so on. Worst of all, bad advice for workarounds which are no longer necessary are still easy to find, which causes users to pre-emptively break their environments where they really don’t need to.

From The Ashes

In the midst of all this brokenness, there were some who were heroically, quietly, slowly fixing the mess, one gnarly bug-report at a time. pip was started, and its various maintainers fixed much of easy_install’s overcomplexity and many of its flaws. Donald Stufft stepped in both on Pip and PyPI and improved the availability of the systems it depended upon, as well as some pretty serious vulnerabilities in the tool itself. Daniel Holth wrote a PEP for the wheel format, which allows for binary redistribution of libraries. In other words, it lets authors of packages which need a C compiler to build give their users a way to not have one.

In 2013, setuptools and distribute un-forked, providing a path forward for operating system vendors to start updating their installations and allowing users to use something modern.

Python Core started distributing the ensurepip module along with both Python 2.7 and 3.3, allowing any user with a recent Python installed to quickly bootstrap into a sensible Python development environment with a one-liner.

A New Renaissance

I won’t give you a full run-down of the state of the packaging art. There’s already a website for that. I will, however, give you a précis of how much easier it is to get started nowadays. Today, if you want to get a sensible, up-to-date python development environment, without administrative privileges, all you have to do is:

$ python -m ensurepip --user
$ python -m pip install --user --upgrade pip
$ python -m pip install --user --upgrade virtualenv

Then, for each project you want to do, make a new virtualenv:

$ python -m virtualenv lets-go
$ . ./lets-go/bin/activate
(lets-go) $ _

From here on out, now the world is your oyster; you can pip install to your heart’s content, and you probably won’t even need to compile any C for most packages. These instructions don’t depend on Python version, either: as long as it’s up-to-date, the same steps work on Python 2, Python 3, PyPy and even Jython. In fact, often the ensurepip step isn’t even necessary since pip comes preinstalled. Running it if it’s unnecessary is harmless, even!

Other, more advanced packaging operations are much simpler than they used to be, too.

Need a C compiler? OS vendors have been working with the open source community to make this easier across the board:

$ apt install build-essential python-dev # ubuntu
$ xcode-select --install # macOS
$ dnf install @development-tools python-devel # fedora
C:\> REM windows
C:\> start https://www.microsoft.com/en-us/download/details.aspx?id=44266

Okay that last one’s not as obvious as it ought to be but they did at least make it freely available!

Want to upload some stuff to PyPI? This should do it for almost any project:

$ pip install twine
$ python setup.py sdist bdist_wheel
$ twine upload dist/*

Want to build wheels for the wild and wooly world of Linux? There’s an app⁴ for that.

Importantly, PyPI will almost certainly be online. Not only that, but a new, revamped site will be “launching” any day now³.

Again, this isn’t a comprehensive resource; I just want to give you an idea of what’s possible. But, as a deeply experienced Python expert I used to swear at these tools six times a day for years; the most serious Python packaging issue I’ve had this year to date was fixed by cleaning up my git repo to delete a cache file.

Work Still To Do

While the current situation is good, it’s still not great.

Here are just a few of my desiderata:

We still need better and more universally agreed-upon tooling for end-user deployments.
Pip should have a GUI frontend so that users can write Python stuff without learning as much command-line arcana.
There should be tools that help you write and update a setup.py. Or a setup.python.json or something, so you don’t actually need to write code just to ship some metadata.
The error messages that you get when you try to build something that needs a C compiler and it doesn’t work should be clearer and more actionable for users who don’t already know what they mean.
PyPI should automatically build wheels for all platforms by default when you upload sdists; this is a huge project, of course, but it would be super awesome default behavior.

I could go on. There are lots of ways that Python packaging could be better.

The Bottom Line

The real takeaway here though, is that although it’s still not perfect, other languages are no longer doing appreciably better. Go is still working through a number of different options regarding dependency management and vendoring, and, like Python extensions that require C dependencies, CGo is sometimes necessary and always a problem. Node has had its own well-publicized problems with their dependency management culture and package manager. Hackage is cool and all but everything takes a literal geological epoch to compile.

As always, I’m sure none of this applies to Rust and Cargo is basically perfect, but that doesn’t matter, because nobody reading this is actually using Rust.

My point is not that packaging in any of these languages is particularly bad. They’re all actually doing pretty well, especially compared to the state of the general programming ecosystem a few years ago; many of them are making regular progress towards user-facing improvements.

My point is that any commentary suggesting they’re meaningfully better than Python at this point is probably just out of date. Working with Python packaging is more or less fine right now. It could be better, but lots of people are working on improving it, and the structural problems that prevented those improvements from being adopted by the community in a timely manner have almost all been addressed.

Go! Make some virtualenvs! Hack some setup.pys! If it’s been a while and your last experience was really miserable, I promise, it’s better now.

Am I wrong? Did I screw up a detail of your favorite language? Did I forget to mention the one language environment that has a completely perfect, flawless packaging story? Do you feel the need to just yell at a stranger on the Internet about picayune details? Feel free to get in touch!

released in October, 2000 ↩
say that five times fast. ↩
although I’m not sure what it means to “launch” when the site is online, and running against the production data-store, and you can use it for pretty much everything... ↩
“app” meaning of course “docker container” ↩

The One Python Library Everyone Needs

Use attrs. Use it. Use it for everything.

python programming attrs Friday August 12, 2016

attrs is still an excellent library, and it still has much to recommend it over the standard library for many applications. However, as of this update in 2023, much new code — perhaps even most of it — should just be using dataclasses. If you want to see where my thinking has moved on to in the intervening 7 years, you may want to check out this more recent post. To make a long story short though, I was right, and attrs set us on the road to a major upgrade to best practices for class definition in Python.

Do you write programs in Python? You should be using attrs.

Why, you ask? Don’t ask. Just use it.

Okay, fine. Let me back up.

I love Python; it’s been my primary programming language for 10+ years and despite a number of interesting developments in the interim I have no plans to switch to anything else.

But Python is not without its problems. In some cases it encourages you to do the wrong thing. Particularly, there is a deeply unfortunate proliferation of class inheritance and the God-object anti-pattern in many libraries.

One cause for this might be that Python is a highly accessible language, so less experienced programmers make mistakes that they then have to live with forever.

But I think that perhaps a more significant reason is the fact that Python sometimes punishes you for trying to do the right thing.

The “right thing” in the context of object design is to make lots of small, self-contained classes that do one thing and do it well. For example, if you notice your object is starting to accrue a lot of private methods, perhaps you should be making those “public”¹ methods of a private attribute. But if it’s tedious to do that, you probably won’t bother.

Another place you probably should be defining an object is when you have a bag of related data that needs its relationships, invariants, and behavior explained. Python makes it soooo easy to just define a tuple or a list. The first couple of times you type host, port = ... instead of address = ... it doesn’t seem like a big deal, but then soon enough you’re typing [(family, socktype, proto, canonname, sockaddr)] = ... everywhere and your life is filled with regret. That is, if you’re lucky. If you’re not lucky, you’re just maintaining code that does something like values[0][7][4][HOSTNAME][“canonical”] and your life is filled with garden-variety pain rather than the more complex and nuanced emotion of regret.

This raises the question: is it tedious to make a class in Python? Let’s look at a simple data structure: a 3-dimensional cartesian coordinate. It starts off simply enough:

class Point3D(object):

So far so good. We’ve got a 3 dimensional point. What next?

class Point3D(object):
    def __init__(self, x, y, z):

Well, that’s a bit unfortunate. I just want a holder for a little bit of data, and I’ve already had to override a special method from the Python runtime with an internal naming convention? Not too bad, I suppose; all programming is weird symbols after a fashion.

At least I see my attribute names in there, that makes sense.

class Point3D(object):
    def __init__(self, x, y, z):
        self.x

I already said I wanted an x, but now I have to assign it as an attribute...

class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x

... to x? Uh, obviously ...

class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

... and now I have to do that once for every attribute, so this actually scales poorly? I have to type every attribute name 3 times?!?

Oh well. At least I’m done now.

class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):

Wait what do you mean I’m not done.

class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):
        return (self.__class__.__name__ +
                ("(x={}, y={}, z={})".format(self.x, self.y, self.z)))

Oh come on. So I have to type every attribute name 5 times, if I want to be able to see what the heck this thing is when I’m debugging, which a tuple would have given me for free?!?!?

class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):
        return (self.__class__.__name__ +
                ("(x={}, y={}, z={})".format(self.x, self.y, self.z)))
    def __eq__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) == (other.x, other.y, other.z)

7 times?!?!?!?

class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):
        return (self.__class__.__name__ +
                ("(x={}, y={}, z={})".format(self.x, self.y, self.z)))
    def __eq__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) == (other.x, other.y, other.z)
    def __lt__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) < (other.x, other.y, other.z)

9 times?!?!?!?!?

from functools import total_ordering
@total_ordering
class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):
        return (self.__class__.__name__ +
                ("(x={}, y={}, z={})".format(self.x, self.y, self.z)))
    def __eq__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) == (other.x, other.y, other.z)
    def __lt__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) < (other.x, other.y, other.z)

Okay, whew - 2 more lines of code isn’t great, but now at least we don’t have to define all the other comparison methods. But now we’re done, right?

from unittest import TestCase
class Point3DTests(TestCase):

You know what? I’m done. 20 lines of code so far and we don’t even have a class that does anything; the hard part of this problem was supposed to be the quaternion solver, not “make a data structure which can be printed and compared”. I’m all in on piles of undocumented garbage tuples, lists, and dictionaries it is; defining proper data structures well is way too hard in Python.²

`namedtuple` to the (not really) rescue

The standard library’s answer to this conundrum is namedtuple. While a valiant first draft (it bears many similarities to my own somewhat embarrassing and antiquated entry in this genre) namedtuple is unfortunately unsalvageable. It exports a huge amount of undesirable public functionality which would be a huge compatibility nightmare to maintain, and it doesn’t address half the problems that one runs into. A full enumeration of its shortcomings would be tedious, but a few of the highlights:

Its fields are accessable as numbered indexes whether you want them to be or not. Among other things, this means you can’t have private attributes, because they’re exposed via the apparently public __getitem__ interface.
It compares equal to a raw tuple of the same values, so it’s easy to get into bizarre type confusion, especially if you’re trying to use it to migrate away from using tuples and lists.
It’s a tuple, so it’s always immutable. Sort of.

As to that last point, either you can use it like this:

Point3D = namedtuple('Point3D', ['x', 'y', 'z'])

in which case it doesn’t look like a type in your code; simple syntax-analysis tools without special cases won’t recognize it as one. You can’t give it any other behaviors this way, since there’s nowhere to put a method. Not to mention the fact that you had to type the class’s name twice.

Alternately you can use inheritance and do this:

class Point3D(namedtuple('_Point3DBase', 'x y z'.split()])):
    pass

This gives you a place you can put methods, and a docstring, and generally have it look like a class, which it is... but in return you now have a weird internal name (which, by the way, is what shows up in the repr, not the class’s actual name). However, you’ve also silently made the attributes not listed here mutable, a strange side-effect of adding the class declaration; that is, unless you add __slots__ = 'x y z'.split() to the class body, and then we’re just back to typing every attribute name twice.

And this doesn’t even mention the fact that science has proven that you shouldn’t use inheritance.

So, namedtuple can be an improvement if it’s all you’ve got, but only in some cases, and it has its own weird baggage.

Enter The `attr`

So here’s where my favorite mandatory Python library comes in.

Let’s re-examine the problem above. How do I make Point3D with attrs?

import attr
@attr.s

Since this isn’t built into the language, we do have to have 2 lines of boilerplate to get us started: the import and the decorator saying we’re about to use it.

import attr
@attr.s
class Point3D(object):

Look, no inheritance! By using a class decorator, Point3D remains a Plain Old Python Class (albeit with some helpful double-underscore methods tacked on, as we’ll see momentarily).

import attr
@attr.s
class Point3D(object):
    x = attr.ib()

It has an attribute called x.

import attr
@attr.s
class Point3D(object):
    x = attr.ib()
    y = attr.ib()
    z = attr.ib()

And one called y and one called z and we’re done.

We’re done? Wait. What about a nice string representation?

>>> Point3D(1, 2, 3)
Point3D(x=1, y=2, z=3)

Comparison?

>>> Point3D(1, 2, 3) == Point3D(1, 2, 3)
True
>>> Point3D(3, 2, 1) == Point3D(1, 2, 3)
False
>>> Point3D(3, 2, 3) > Point3D(1, 2, 3)
True

Okay sure but what if I want to extract the data defined in explicit attributes in a format appropriate for JSON serialization?

>>> attr.asdict(Point3D(1, 2, 3))
{'y': 2, 'x': 1, 'z': 3}

Maybe that last one was a little on the nose. But nevertheless, it’s one of many things that becomes easier because attrs lets you declare the fields on your class, along with lots of potentially interesting metadata about them, and then get that metadata back out.

>>> import pprint
>>> pprint.pprint(attr.fields(Point3D))
(Attribute(name='x', default=NOTHING, validator=None, repr=True, cmp=True, hash=True, init=True, convert=None),
 Attribute(name='y', default=NOTHING, validator=None, repr=True, cmp=True, hash=True, init=True, convert=None),
 Attribute(name='z', default=NOTHING, validator=None, repr=True, cmp=True, hash=True, init=True, convert=None))

I am not going to dive into every interesting feature of attrs here; you can read the documentation for that. Plus, it’s well-maintained, so new goodies show up every so often and I might miss something important. But attrs does a few key things that, once you have them, you realize that Python was sorely missing before:

It lets you define types concisely, as opposed to the normally quite verbose manual def __init__.... Types without typing.
It lets you say what you mean directly with a declaration rather than expressing it in a roundabout imperative recipe. Instead of “I have a type, it’s called MyType, it has a constructor, in the constructor I assign the property ‘A’ to the parameter ‘A’ (and so on)”, you say “I have a type, it’s called MyType, it has an attribute called a”, and behavior is derived from that fact, rather than having to later guess about the fact by reverse engineering it from behavior (for example, running dir on an instance, or looking at self.__class__.__dict__).
It provides useful default behavior, as opposed to Python’s sometimes-useful but often-backwards defaults.
It adds a place for you to put a more rigorous implementation later, while starting out simple.

Let’s explore that last point.

Progressive Enhancement

While I’m not going to talk about every feature, I’d be remiss if I didn’t mention a few of them. As you can see from those mile-long repr()s for Attribute above, there are a number of interesting ones.

For example: you can validate attributes when they are passed into an @attr.s-ified class. Our Point3D, for example, should probably contain numbers. For simplicity’s sake, we could say that that means instances of float, like so:

import attr
from attr.validators import instance_of
@attr.s
class Point3D(object):
    x = attr.ib(validator=instance_of(float))
    y = attr.ib(validator=instance_of(float))
    z = attr.ib(validator=instance_of(float))

The fact that we were using attrs means we have a place to put this extra validation: we can just add type information to each attribute as we need it. Some of these facilities let us avoid other common mistakes. For example, this is a popular “spot the bug” Python interview question:

class Bag:
    def __init__(self, contents=[]):
        self._contents = contents
    def add(self, something):
        self._contents.append(something)
    def get(self):
        return self._contents[:]

Fixing it, of course, becomes this:

class Bag:
    def __init__(self, contents=None):
        if contents is None:
            contents = []
        self._contents = contents

adding two extra lines of code.

contents inadvertently becomes a global varible here, making all Bag objects not provided with a different list share the same list. With attrs this instead becomes:

@attr.s
class Bag:
    _contents = attr.ib(default=attr.Factory(list))
    def add(self, something):
        self._contents.append(something)
    def get(self):
        return self._contents[:]

There are several other features that attrs provides you with opportunities to make your classes both more convenient and more correct. Another great example? If you want to be strict about extraneous attributes on your objects (or more memory-efficient on CPython), you can just pass slots=True at the class level - e.g. @attr.s(slots=True) - to automatically turn your existing attrs declarations a matching __slots__ attribute. All of these handy features allow you to make better and more powerful use of your attr.ib() declarations.

The Python Of The Future

Some people are excited about eventually being able to program in Python 3 everywhere. What I’m looking forward to is being able to program in Python-with-attrs everywhere. It exerts a subtle, but positive, design influence in all the codebases I’ve seen it used in.

Give it a try: you may find yourself surprised at places where you’ll now use a tidily explained class, where previously you might have used a sparsely-documented tuple, list, or a dict, and endure the occasional confusion from co-maintainers. Now that it’s so easy to have structured types that clearly point in the direction of their purpose (in their __repr__, in their __doc__, or even just in the names of their attributes), you might find you’ll use a lot more of them. Your code will be better for it; I know mine has been.

Scare quotes here because the attributes aren’t meaningfully exposed to the caller, they’re just named publicly. This pattern, getting rid of private methods entirely and having only private attributes, probably deserves its own post... ↩
And we hadn’t even gotten to the really exciting stuff yet: type validation on construction, default mutable values... ↩

Python Option Types

Ask not for whom the NULL tolls; it tolls for thee.

python programming Thursday September 17, 2015

Updated 2020-07-19: While many of the conclusions in this article remain valid and interesting, a few of them — particularly those related to raising run-time exceptions rather than using Nones to signal errors — have been subtly changed by the advent of mypy's ability to force the caller to check for None automatically. Effectively, when this was written, Python did not have real Optional[] types, and now, thanks to mypy, it does, so you should probably use them!

NULL has, rightly, been called a “billion dollar mistake”. If that is so, then None is a hundred million dollar mistake, at least.

Forgetting, for the moment, about the numerous pitfalls of a C-style NULL, Python’s None has a very significant problem of its own. Of course, the problem is not None itself; the fact that the default return value of a function is None is (in my humble opinion, at least) fine; it’s just a marker that means “nothing to see here, move along”. The problem arises from values which might be None, or might be some other, useful thing.

APIs present values in a number of ways. A value might be exposed as the return value of a method, an attribute of an object, or an entry in a collection data structure such as a list or dictionary. If a value presented in an API might be None or it might be something else, every single client of that API needs to check the type of the value that it’s calling before doing anything.

Since it is rude to use a “simple suite” (a line of code like if x: y() with no newline after the colon), that means the minimum number of lines of code for interacting with your API is now 4: one for the if statement, one for the then clause, and one for the else clause.

Worse than the code-bloat required here, the default behavior, if your forget to do this checking, is that it works sometimes (like when you’re testing it), and that other times (like when you put it into production), you get an unhelpful exception like this:

Traceback (most recent call last):
  File "<your code>", line 1, in <module>
    value.method()
AttributeError: 'NoneType' object has no attribute 'method'

Of course NoneType doesn’t have an attribute called method, but why is value a NoneType? Science may never know.

In languages with static type declarations, there’s a concept of an Option type. Simply put, in a language with option types, the API declares its result value as “maybe something, maybe null”, and then if the caller fails to account for the “null” case, it is a compile-time error.

Python doesn’t have this kind of ahead-of-time checking though, so what are we to do? In order of my own personal preference, here are three strategies for getting rid of maybe-None-maybe-not data types in your Python code.

1: Just Say No

Some APIs - especially those that require building deeply complex nested trees of data structures - use None as a way to provide a convenient mechanism for leaving a space for a future value to be filled out. Using such an API sometimes looks like this:

value = MyValue()
value.foo = 1
value.bar = 2
value.baz = 3
value.do_something()

In this case, the way to get rid of None is simple: just stop doing that. Instead of .foo having an implicit type of “int or None”, just make it always be int, like this:

value = MyValue(
    foo=1, bar=2, baz=3
)
value.do_something()

Or, if do_something is the only method you’re going to call with this data structure, opt for the even simpler:

do_something_with_value(foo=1, bar=2, baz=3)

If MyValue has dozens of fields that need to be initialized with different subsystems, so you actually want to pass around a partially-initialized value object, consider the Builder pattern, which would make this code look like the following:

builder = MyValueBuilder()
foo = builder.with_foo(1)
bar = foo.with_bar(2)
baz = bar.with_baz(3)
value = baz.build()
value.do_something()

This acknowledges that the partially-constructed MyValueBuilder is a different type than MyValue, and, crucially, if you look at its API documentation, it does not misleadingly appear to support the do_something operation which in fact requires foo, bar, and baz all be initialized.

Wherever possible, just require values of the appropriate type be passed in in the first place, and don’t ever default to None.

2: Make The Library Handle The Different States, Not The Caller

Sometimes, None is a placeholder indicating an implicit state machine, where the states are “initialized” and “not initialized”.

For example, imagine an RPC Client which may or may not be connected. You might have an API you have to use like this:

def ask_question(self):
    message = {"question":
               "what is the air speed velocity of an unladen swallow?"}
    if self.rpc_client.connection is None:
        self.outbound_messages.append(message)
        self.rpc_client.when_connected(self.flush_outbound_messages)
    else:
        self.rpc_client.send_message(message)

By leaking through the connection attribute, rpc_client is providing an incomplete abstraction and foisting off too much work to its callers. Instead, callers should just have to do this:

def ask_question(self):
    message = {"question":
               "what is the air speed velocity of an unladen swallow?"}
    self.rpc_client.send_message(message)

Internally, rpc_client still has to maintain a private _connection attribute which may or may not be present, but by hiding this implementation detail, we centralize the complexity associated with managing that state in one place, rather than polluting every caller with it, which makes for much better API design.

Hopefully you agree that this is a good idea, but this is more what to do rather than how to do it, so here are two strategies for achieving “make the library do it”:

2a: Use Placeholder Implementations

However, rather than using None as the _connection attribute, rpc_client’s internal implementation could instead use a placeholder which provides the same interface. Let’s the expected interface of _connection in this case is just a send method that takes some bytes. We could initialize it initially with this:

class NotConnectedConnection(object):
    def __init__(self):
        self._buffer = b""
    def send(self, data):
        self._buffer += b

This allows the code within rpc_client itself to blindly call self._connection.send whether it’s actually connected already or not; upon connection, it could un-buffer that data onto the ready connection.

2b: Use an Explicit State Machine

Sometimes, you actually have quite a few states you need to manage, and this starts looking like an ugly proliferation of lots of weird little flags; various values which may be True or False, or None or not-None.

In those cases it’s best to be clear about the fact that there are multiple states, and enumerating the valid transitions between them. Then, expose a method which always has the same signature and return type.

Using a state machine library like ClusterHQ’s “machinist” or my Automat can allow you to automate the process of checking all the states.

Automat, in particular, goes to great lengths to make your objects look like plain old Python objects. Providing an input is just calling a method on your object: my_state_machine.provide_an_input() and receiving an output is just examining its return value. So it’s possible to refactor your code away from having to check for None by using this library.

For example, the connection-handling example above could be dealt with in the RPC client using Automat like so:

class RPCClient(object):
    _machine = MethodicalMachine()
    @_machine.state()
    def _connected(self):
        "We have a connection."
    @_machine.state()
    def _not_connected(self):
        "We have no connection."
    @_machine.input()
    def send_message(self, message):
        "Send a message now if we're connected, or later if not."
    @_machine.output()
    def _send_message_now(self, message):
        "Send a message immediately."
        # ...
    @_machine.output()
    def _send_message_later(self, message):
        "Enqueue a message for when we are connected."
        # ...
    @_machine.output()
    def _send_queued_messages(self, connection):
        "send all messages enqueued by _send_message_later"
        # ...
    @_machine.input()
    def connection_established(self, connection):
        "A connection was established."
    _connected.upon(send_message, enter=_connected, output=[_send_message_now])
    _not_connected.upon(send_message, enter=_connected,
                        output=[_send_message_later])
    _not_connected.upon(connection_established, enter=_connected,
                        output=[_send_queued_messages])

3: Make The Caller Account For All Cases With Callbacks

The absolute lowest-level way to deal with multiple possible states is to, instead of exposing an attribute that the caller has to retrieve and test, expose a function which takes multiple callbacks, one for each case. This way you can provide clear and immediate error feedback if the caller forgets to handle a case - meaning that they forgot to pass a callback. This is only really suitable if you can’t think of any other way to handle it, but it does at least provide a very clear expectation of the interface.

To re-use our connection-handling logic above, you might do something like this:

def try_to_send(self):
    def connection_present(connection):
        connection.send_message(my_message)
    def connection_not_present():
        self.enqueue_message_for_later(my_message)
    self.rpc_client.with_connection(connection_present,
                                    connection_not_present)

Notice that while this is slightly awkward, it has the nice property that the connection_present callback receives the value that it needs, whereas the connection_not_present callback doesn’t receive anything, because there’s nothing for it to receive.

The Zeroth Strategy

Of course, the best strategy, if you can get away with it, may be the non-strategy: refuse the temptation to provide a maybe-None, just raise an exception when you are in a state where you can’t handle. If you intentionally raise a specific, meaningful exception type with a good error message, it will be a lot more pleasant to use your API than if return codes that the caller has to check for pop up all over the place, None or otherwise.

The Principle Of The Thing

The underlying principle here is the same: when designing an API, always provide a consistent interface to your callers. An API is 1 to N: you have 1 API implementation to N callers. As N→∞, it becomes more important that any task that needs performing frequently is performed on the “1” side of the equation, and that you don’t force callers to repeat the same error checking over and over again. None is just one form of this, but it is a particularly egregious form.

Software You Can Use

The Python community needs a tool for distributing software to end users.

python programming deployment Saturday September 12, 2015

Python has a big problem. While it’s easy and fun to produce software in Python, it’s hard to produce software that people - especially laypeople who are not professional software developers - can use.

In the modern software ecosystem, there are a few places that you might want to use a program:

On a server, by loading a web page.
On a web page, by running it in your browser.
As a command-line tool, on Mac,
... Windows,
... or Linux.
As a desktop application, on Mac,
... Windows,
... or Linux.
As a mobile application, on iOS,
... or Android,
... Or Windows Phone. (Just kidding.)

Out of these 10 scenarios, Python currently has half of a good deployment story for one of them: running an application on a server, as a back-end. This is a serious problem for the future of Python and one we need to figure out how to face as a community.

Even the “good” deployment story is somewhat convoluted, as you need to know about at least some Linux distribution’s package manager, and native dependencies, and Pip, and virtualenv, and wheels, and probably docker too.

If you want to run a Python application in your browser, your best bet right now is probably Brython. However, brython is still in its infancy, and basic faciltiies like preparing your code for production to achieve acceptable start-up performance, and an explanation of how to use libraries are missing. With big chunks like that missing it’s hard to advocate for Brython’s use in production.

Moving on to scenario 3, this may be one of the best-supported configurations; py2app actually works surprisingly well. But it’s still incredibly confusing for new users. Which native objects (dylibs and frameworks and data files) to bundle are options to py2app itself, and not something that can be handled automatically by libraries. So if, for example, PyGame depends on SDL.framework or libSDL.dylib, you as an application developer need to understand how to figure that out and specify that list.

On Windows, the situation gets worse. To work as a Windows exectuable, you need to bundle the Python interpreter, but unlike in an OS X application, you can’t just copy in a whole directory. So you end up needing a tool like PyInstaller or cx_Freeze. PyInstaller hasn’t seen a release in the last 2 years; it doesn’t support Python 3. It also doesn’t work: if I try to package the most basic Twisted program possible, with pyinstaller 2.1 I get “no module named zope.interface”, and if I try to package it with pyinstaller trunk, I get “no module named itertools”. cx_Freeze similarly can’t figure out how to include zope.interface no matter what I tell it to do. This problem isn’t specific to libraries that I use; most Python projects will run into it.

py2exe, on the other hand, only supports Python 3.3+, and so is unusable with a lot of important python libraries.

For a GUI application for Linux, you might have some small hope of building a distro-specific package that users could install, but that would involve using a distro-specific toolchain that had nothing to do with Python, and you need to repeat that work for Debian, Ubuntu, Fedora, and whatever other distros you want to support.

All of these same tools are what I would use to build a stand-alone command-line executable for Windows, Mac, or Linux, and they all break down in similar ways.

In the mobile space, there is absolutely zero tooling included with the language to even get started there. It might be possible to use Kivy to get a build onto iOS or Android. I haven’t had an opportunity to test those. But they still require you to install Homebrew, and a C compiler, and a whole bunch of fairly specific platform tooling to get started, and there are lots of different ways that can go wrong.

So how do other languages stack up?

In JavaScript, if you want an application in the browser, it’s as simple as ... writing some JavaScript.
In JavaScript, if you want a desktop application, you can just grab Electron and be up and running in a few minutes.
In JavaScript, if you want a command-line UNIX tool, you can grab nar and build something self-contained almost immediately.
And of course, in Go, there’s no way to get anything but a fully functional self-contained executable at the end of the build process. Everything is fully redistributable by default.

As a community, Python needs a clear, well-documented, well-supported, modern way to produce build artifacts that are easy to create and easy to share. We need to have this for all popular platforms and the browser. This is a tricky problem: it requires knowledge of lots of fiddly build details.

This wheel has been re-invented, poorly, a dozen or so times. My list above was just a subset. In addition to py2app, py2exe, pyinstaller, cx_Freeze, and the Kivy bundling tools, we’ve also got terrarium, bbfreeze (which is unmaintained), pipsi, pex, and probably some others I don’t know about.

In order to compete with JavaScript and Go for developers’ attention, Python must be able to become an implementation detail and disappear when the user is running the program. This means that some of these tools (terrarium, pipsi, pex) are not suitable for this purpose because they are envelopes for deployment into an environment with an installed Python interpreter.

All of the tools I’m aware of that are trying to provide fully self-contained execution, though (pyinstaller, cx_freeze, bb-freeze, py2app) are poorly designed because they value optimized distribution size over actually working by default. Rather than reading setuptools metadata and discovering the full set of dependencies which have been declared to be required, all of these tools use weird AST-parsing heuristics and buggy path-traversal hacks to try to construct a guess as to the minimal set of files that might be required, then require the poor application developer to fill in the gaps. This means none of them work with namespace packages, none of them work properly with plugin systems or runtime configuration systems; generally, they don’t work correctly with late binding, which is one of Python’s greatest strengths. Of course, a full Python interpreter with the whole standard library is quite large. If we had a tool that worked well but produced very large executables, we could of course start adding an “optimized mode” to try to crunch things down for production.

And all this is to say nothing of the insanely intricate and detailed knowledge that every Python programmer eventually acquires about the C runtime semantics of their chosen platform. When a C compiler is required but missing, most tools still just emit tracebacks. When a shared library goes missing dues to an OS upgrade or package removal, you just see whatever the dynamic linker thinks to report, no explanation of how to fix it or what to do next.

The Python packaging ecosystem has made great strides in the last few years; Pip, in particular, has gone from a buggy and insecure mess to a mostly workable software delivery mechanism for developers. There are still bugs, but they are getting dealt with at a reasonable clip. However, Pip only delivers software to developers, and still requires you to have a Python runtime, a build environment, and tricky command-line tools to get things in place for development. The Python community has effectively no tools to deliver software to users.

To sum up, we need a tool which:

works by default, including with “tricky” packages with namespace packages, data files, and native dependencies
produces useful, actionable error messages when something is missing and the build can’t be completed (like “you don’t have a C compiler installed” or “you need to install Homebrew and then brew install openssl”)
can produce both command-line and GUI executables for the mac, windows, and linux (and, for bonus points, a web browser)

The bad news is that I don’t have the time to start this project myself, and I’m not sure who does. The worse news is that every day we don’t have this, more and more people are re-writing their user-facing tools and applications in JavaScript or Go or Swift or Java, to suit their target platform, because it is honestly easier to learn an entirely new programming language and toolchain, and rewrite an entire application than to figure out how to build a self-contained executable in Python right now.

The good news, though, is that it’s a simple matter of programming, and that all the core technologies for doing all the really hard things that need to be done (pip, and zipimport and macholib, for example) already exist. It’s just a simple matter of programming: wiring together the metadata from setuptools, determining native dependencies with something like otool or ldd (or whatever the equivalent is on Windows, I still haven’t figured that out myself), pulling them all into a bundle, tacking the Python interpreter on.