The Sororicide Antipattern

Don’t murder your parents or your siblings to get their attributes.

Composition is better than inheritance.”. This is a true statement. “Inheritance is bad.” Also true. I’m a well-known compositional extremist. There’s a great talk you can watch if I haven’t talked your ear off about it already.

Which is why I was extremely surprised in a recent conversation when my interlocutor said that while inheritance might be bad, composition is worse. Once I understood what they meant by “composition”, I was even more surprised to find that I agreed with this assertion.

Although inheritance is bad, it’s very important to understand why. In a high-level language like Python, with first-class runtime datatypes (i.e.: user defined classes that are objects), the computational difference between what we call “composition” and what we call “inheritance” is a matter of where we put a pointer: is it on a type or on an instance? The important distinction has to do with human factors.

First, a brief parable about real-life inheritance.


You find yourself in conversation with an indolent heiress-in-waiting. She complains of her boredom whiling away the time until the dowager countess finally leaves her her fortune.

“Inheritance is bad”, you opine. “It’s better to make your own way in life”.

“By George, you’re right!” she exclaims. You weren’t expecting such an enthusiastic reversal.

“Well,”, you sputter, “glad to see you are turning over a new leaf”.

She crosses the room to open a sturdy mahogany armoire, and draws forth a belt holstering a pistol and a menacing-looking sabre.

“Auntie has only the dwindling remnants of a legacy fortune. The real money has always been with my sister’s manufacturing concern. Why passively wait for Auntie to die, when I can murder my dear sister now, and take what is rightfully mine!”

Cinching the belt around her waist, she strides from the room animated and full of purpose, no longer indolent or in-waiting, but you feel less than satisfied with your advice.

It is, after all, important to understand what the problem with inheritance is.


The primary reason inheritance is bad is confusion between namespaces.

The most important role of code organization (division of code into files, modules, packages, subroutines, data structures, etc) is division of responsibility. In other words, Conway’s Law isn’t just an unfortunate accident of budgeting, but a fundamental property of software design.

For example, if we have a function called multiply(a, b) - its presence in our codebase suggests that if someone were to want to multiply two numbers together, it is multiply’s responsibility to know how to do so. If there’s a problem with multiplication, it’s the maintainers of multiply who need to go fix it.

And, with this responsibility comes authority over a specific scope within the code. So if we were to look at an implementation of multiply:

1
2
3
def multiply(a, b):
    product = a * b
    return product

The maintainers of multiply get to decide what product means in the context of their function. It’s possible, in Python, for some other funciton to reach into multiply with frame objects and mangle the meaning of product between its assignment and return, but it’s generally understood that it’s none of your business what product is, and if you touch it, all bets are off about the correctness of multiply. More importantly, if the maintainers of multiply wanted to bind other names, or change around existing names, like so, in a subsequent version:

1
2
3
4
5
def multiply(a, b):
    factor1 = a
    factor2 = b
    result = a * b
    return result

It is the maintainer of multiply’s job, not the caller of multiply, to make those decisions.

The same programmer may, at different times, be both a caller and a maintainer of multiply. However, they have to know which hat they’re wearing at any given time, so that they can know which stuff they’re still repsonsible for when they hand over multiply to be maintained by a different team.

It’s important to be able to forget about the internals of the local variables in the functions you call. Otherwise, abstractions give us no power: if you have to know the internals of everything you’re using, you can never build much beyond what’s already there, because you’ll be spending all your time trying to understand all the layers below it.

Classes complicate this process of forgetting somewhat. Properties of class instances “stick out”, and are visible to the callers. This can be powerful — and can be a great way to represent shared data structures — but this is exactly why we have the ._ convention in Python: if something starts with an underscore, and it’s not in a namespace you own, you shouldn’t mess with it. So: other._foo is not for you to touch, unless you’re maintaining type(other). self._foo is where you should put your own private state.

So if we have a class like this:

1
2
3
class A(object):
    def __init__(self):
        self._note = "a note"

we all know that A()._note is off-limits.

But then what happens here?

1
2
3
4
class B(A):
    def __init__(self):
        super().__init__()
        self._note = "private state for B()"

B()._note is also off limits for everyone but B, except... as it turns out, B doesn’t really own the namespace of self here, so it’s clashing with what A wants _note to mean. Even if, right now, we were to change it to _note2, the maintainer of A could, in any future release of A, add a new _note2 variable which conflicts with something B is using. A’s maintainers (rightfully) think they own self, B’s maintainers (reasonably) think that they do. This could continue all the way until we get to _note7, at which point it would explode violently.


So that’s why Inheritance is bad. It’s a bad way for two layers of a system to communicate because it leaves each layer nowhere to put its internal state that the other doesn’t need to know about. So what could be worse?

Let’s say we’ve convinced our junior programmer who wrote A that inheritance is a bad interface, and they should instead use the panacea that cures all inherited ills, composition. Great! Let’s just write a B that composes in an A in a nice clean way, instead of doing any gross inheritance:

1
2
3
4
class Bprime(object):
    def __init__(self, a):
        for var in dir(a):
            setattr(self, var, getattr(a, var))

Uh oh. Looks like composition is worse than inheritance.


Let’s enumerate some of the issues with this “solution” to the problem of inheritance:

  • How do we know what attributes Bprime has?
  • How do we even know what type a is?
  • How is anyone ever going to grep for relevant methods in this code and have them come up in the right place?

We briefly reclaimed self for Bprime by removing the inheritance from A, but what Bprime does in __init__ to replace it is much worse. At least with normal, “vertical” inheritance, IDEs and code inspection tools can have some idea where your parents are and what methods they declare. We have to look aside to know what’s there, but at least it’s clear from the code’s structure where exactly we have to look aside to.

When faced with a class like Bprime though, what does one do? It’s just shredding apart some apparently totally unrelated object, there’s nearly no way for tooling to inspect this code to the point that they know where self.<something> comes from in a method defined on Bprime.

The goal of replacing inheritance with composition is to make it clear and easy to understand what code owns each attribute on self. Sometimes that clarity comes at the expense of a few extra keystrokes; an __init__ that copies over a few specific attributes, or a method that does nothing but forward a message, like def something(self): return self.other.something().

Automatic composition is just lateral inheritance. Magically auto-proxying all methods1, or auto-copying all attributes, saves a few keystrokes at the time some new code is created at the expense of hours of debugging when it is being maintained. If readability counts, we should never privilege the writer over the reader.


  1. It is left as an exercise for the reader why proxyForInterface is still a reasonably okay idea even in the face of this criticism.2 

  2. Although ironically it probably shouldn’t use inheritance as its interface. 

What Am Container

Containers are a tool in the fight against evil.

Perhaps you are a software developer.

Perhaps, as a developer, you have recently become familiar with the term "containers".

Perhaps you have heard containers described as something like "LXC, but better", "an application-level interface to cgroups" or "like virtual machines, but lightweight", or perhaps (even less usefully), a function call. You've probably heard of "docker"; do you wonder whether a container is the same as, different from, or part of an Docker?

Are you are bewildered by the blisteringly fast-paced world of "containers"? Maybe you have no trouble understanding what they are - in fact you might be familiar with a half a dozen orchestration systems and container runtimes already - but frustrated because this seems like a whole lot of work and you just don't see what the point of it all is?

If so, this article is for you.

I'd like to lay out what exactly the point of "containers" are, why people are so excited about them, what makes the ecosystem around them so confusing. Unlike my previous writing on the topic, I'm not going to assume you know anything about the ecosystem in general; just that you have a basic understanding of how UNIX-like operating systems separate processes, files, and networks.1


At the dawn of time, a computer was a single-tasking machine. Somehow, you'd load your program into main memory, and then you'd turn it on; it would run the program, and (if you're lucky) spit out some output onto paper tape.

When a program running on such a computer looked around itself, it could "see" the core memory of the computer it was running on, any attached devices, including consoles, printers, teletypes, or (later) networking equipment. This was of course very powerful - the program had full control of everything attached to the computer - but also somewhat limiting.

This mode of addressing hardware is limiting because it meant that programs would break the instant you moved them to a new computer. They had to be re-written to accommodate new amounts and types of memory, new sizes and brands of storage, new types of networks. If the program had to contain within itself the full knowledge of every piece of hardware that it might ever interact with, it would be very expensive indeed.

Also, if all the resources of a computer were dedicated to one program, then you couldn't run a second program without stomping all over the first one - crashing it by mangling its structures in memory, deleting its data by overwriting its data on disk.

So, programmers cleverly devised a way of indirecting, or "virtualizing", access to hardware resources. Instead of a program simply addressing all the memory in the whole computer, it got its own little space where it could address its own memory - an address space, if you will. If a program wanted more memory, it would ask a supervising program - what we today call a "kernel" - to give it some more memory. This made programs much simpler: instead of memorizing the address offsets where a particular machine kept its memory, a program would simply begin by saying "hey operating system, give me some memory", and then it would access the memory in its own little virtual area.

In other words: memory allocation is just virtual RAM.

Virtualizing memory - i.e. ephemeral storage - wasn't enough; in order to save and transfer data, programs also had to virtualize disk - i.e. persistent storage. Whereas a whole-computer program would just seek to position 0 on the disk and start writing data to it however it pleased, a program writing to a virtualized disk - or, as we might call it today, a "file" - first needed to request a file from the operating system.

In other words: file systems are just virtual disks.

Networking was treated in a similar way. Rather than addressing the entire network connection at once, each program could allocate a little slice of the network - a "port". That way a program could, instead of consuming all network traffic destined for the entire machine, ask the operating system to just deliver it all the traffic for, say, port number seven.

In other words: listening ports are just virtual network cards.


Getting bored by all this obvious stuff yet? Good. One of the things that frustrates me the most about containers is that they are an incredibly obvious idea that is just a logical continuation of a trend that all programmers are intimately familiar with.


All of these different virtual resources exist for the same reason: as I said earlier, if two programs need the same resource to function properly, and they both try to use it without coordinating, they'll both break horribly.2

UNIX-like operating systems more or less virtualize RAM correctly. When one program grabs some RAM, nobody else - modulo super-powered administrative debugging tools - gets to use it without talking to that program. It's extremely clear which memory belongs to which process. If programs want to use shared memory, there is a very specific, opt-in protocol for doing so; it is basically impossible for it to happen by accident.

However, the abstractions we use for disks (filesystems) and network cards (listening ports and addresses) are significantly more limited. Every program on the computer sees the same file-system. The program itself, and the data the program stores, both live on the same file-system. Every program on the computer can see the same network information, can query everything about it, and can receive arbitrary connections. Permissions can remove certain parts of the filesystem from view (i.e. programs can opt-out) but it is far less clear which program "owns" certain parts of the filesystem; access must be carefully controlled, and sometimes mediated by administrators.

In particular, the way that UNIX manages filesystems creates an environment where "installing" a program requires manipulating state in the same place (the filesystem) where other programs might require different state. Popular package managers on UNIX-like systems (APT, RPM, and so on) rarely have a way to separate program installation even by convention, let alone by strict enforcement. If you want to do that, you have to re-compile the software with ./configure --prefix to hard-code a new location. And, fundamentally, this is why the package managers don't support installing to a different place: if the program can tell the difference between different installation locations, then it will, because its developers thought it should go in one place on the file system, and why not hard code it? It works on their machine.


In order to address this shortcoming of the UNIX process model, the concept of "virtualization" became popular. The idea of virtualization is simple: you write a program which emulates an entire computer, with its own storage media, network devices, and then you install an operating system on it. This completely resolves the over-sharing of resources: a process inside a virtual machine is in a very real sense running on a different computer than programs running on a different virtual machine on the same physical device.

However, virtualiztion is also an extremly heavy-weight blunt instrument. Since virtual machines are running operating systems designed for physical machines, they have tons of redundant hardware-management code; enormous amounts of operating system data which could be shared with the host, but since it's in the form of a disk image totally managed by the virtual machine's operating system, the host can't really peek inside to optimize anything. It also makes other kinds of intentional resource sharing very hard: any software to manage the host needs to be installed on the host, since if it is installed on the guest it won't have full access to the host's hardware.

I hate using the term "heavy-weight" when I'm talking about software - it's often bandied about as a content-free criticism - but the difference in overhead between running a virtual machine and a process is the difference between gigabytes and kilobytes; somewhere between 4-6 orders of magnitude. That's a huge difference.

This means that you need to treat virtual machines as multi-purpose, since one VM is too big to run just a single small program. Which means you often have to manage them almost as if they were physical harware.


When we run a program on a UNIX-like operating system, and by so running it, grant it its very own address space, we call the entity that we just created a "process".

This is how to understand a "container".

A "container" is what we get when we run a program and give it not just its own memory, but its own whole virtual filesystem and its own whole virtual network card.

The metaphor to processes isn't perfect, because a container can contain multiple processes with different memory spaces that share a single filesystem. But this is also where some of the "container ecosystem" fervor begins to creep in - this is why people interested in containers will religiously exhort you to treat a container as a single application, not to run multiple things inside it, not to SSH into it, and so on. This is because the whole point of containers is that they are lightweight - far closer in overhead to the size of a process than that of a virtual machine.

A process inside a container, if it queries the operating system, will see a computer where only it is running, where it owns the entire filesystem, and where any mounted disks were explicitly put there by the administrator who ran the container. In other words, if it wants to share data with another application, it has to be given the shared data; opt-in, not opt-out, the same way that memory-sharing is opt-in in a UNIX-like system.


So why is this so exciting?

In a sense, it really is just a lower-overhead way to run a virtual machine, as long as it shares the same kernel. That's not super exciting, by itself.

The reason that containers are more exciting than processes is the same reason that using a filesystem is more exciting than having to use a whole disk: sharing state always, inevitably, leads to brokenness. Opt-in is better than opt-out.

When you give a program a whole filesystem to itself, sharing any data explicitly, you eliminate even the possibility that some other program scribbling on a shared area of the filesystem might break it. You don't need package managers any more, only package installers; by removing the other functions of package managers (inventory, removal) they can be radically simplified, and less complexity means less brokenness.

When you give a program an entire network address to itself, exposing any ports explicitly, you eliminate even the possibility that some rogue program will expose a security hole by listening on a port you weren't expecting. You eliminate the possibility that it might clash with other programs on the same host, hard-coding the same port numbers or auto-discovering the same addresses.


In addition to the exciting things on the run-time side, containers - or rather, the things you run to get containers, "images"3, present some compelling improvements to the build-time side.

On Linux and Windows, building a software artifact for distribution to end-users can be quite challenging. It's challenging because it's not clear how to specify that you depend on certain other software being installed; it's not clear what to do if you have conflicting versions of that software that may not be the same as the versions already available on the user's computer. It's not clear where to put things on the filesystem. On Linux, this often just means getting all of your software from your operating system distributor.

You'll notice I said "Linux and Windows"; not the usual (linux, windows, mac) big-3 desktop platforms, and I didn't say anything about mobile OSes. That's because on macOS, Android, iOS, and Windows Metro, applications already run in their own containers. The rules of macOS containers are a bit weird, and very different from Docker containers, but if you have a Mac you can check out ~/Library/Containers to see the view of the world that the applications you're running can see. iOS looks much the same.

This is something that doesn't get discussed a lot in the container ecosystem, partially because everyone is developing technology at such a breakneck pace, but in many ways Linux server-side containerization is just a continuation of a trend that started on mainframe operating systems in the 1970s and has already been picked up in full force by mobile operating systems.

When one builds an image, one is building a picture of the entire filesystem that the container will see, so an image is a complete artifact. By contrast, a package for a Linux package manager is just a fragment of a program, leaving out all of its dependencies, to be integrated later. If an image runs on your machine, it will (except in some extremely unusual circumstances) run on the target machine, because everything it needs to run is fully included.

Because you build all the software an image requires into the image itself, there are some implications for server management. You no longer need to apply security updates to a machine - they get applied to one application at a time, and they get applied as a normal process of deploying new code. Since there's only one update process, which is "delete the old container, run a new one with a new image", updates can roll out much faster, because you can build an image, run tests for the image with the security updates applied, and be confident that it won't break anything. No more scheduling maintenance windows, or managing reboots (at least for security updates to applications and libraries; kernel updates are a different kettle of fish).


That's why it's exciting. So why's it all so confusing?5

Fundamentally the confusion is caused by there just being way too many tools. Why so many tools? Once you've accepted that your software should live in images, none of the old tools work any more. Almost every administrative, monitoring, or management tool for UNIX-like OSes depends intimately upon the ability to promiscuously share the entire filesystem with every other program running on it. Containers break these assumptions, and so new tools need to be built. Nobody really agrees on how those tools should work, and a wide variety of forces ranging from competitive pressure to personality conflicts make it difficult for the panoply of container vendors to collaborate perfectly4.

Many companies whose core business has nothing to do with infrastructure have gone through this reasoning process:

  1. Containers are so much better than processes, we need to start using them right away, even if there's some tooling pain in adopting them.
  2. The old tools don't work.
  3. The new tools from the tool vendors aren't ready.
  4. The new tools from the community don't work for our use-case.
  5. Time to write our own tool, just for our use-case and nobody else's! (Which causes problem #3 for somebody else, of course...)

A less fundamental reason is too much focus on scale. If you're running a small-scale web application which has a stable user-base that you don't expect a lot of growth in, there are many great reasons to adopt containers as opposed to automating your operations; and in fact, if you keep things simple, the very fact that your software runs in a container might obviate the need for a system-management solution like Chef, Ansible, Puppet, or Salt. You should totally adopt them and try to ignore the more complex and involved parts of running an orchestration system.

However, containers are even more useful at significant scale, which means that companies which have significant scaling problems invest in containers heavily and write about them prolifically. Many guides and tutorials on containers assume that you expect to be running a multi-million-node cluster with fully automated continuous deployment, blue-green zero-downtime deploys, a 1000-person operations team. It's great if you've got all that stuff, but building each of those components is a non-trivial investment.


So, where does that leave you, my dear reader?

You should absolutely be adopting "container technology", which is to say, you should probably at least be using Docker to build your software. But there are other, radically different container systems - like Sandstorm - which might make sense for you, depending on what kind of services you create. And of course there's a huge ecosystem of other tools you might want to use; too many to mention, although I will shout out to my own employer's docker-as-a-service Carina, which delivered this blog post, among other things, to you.

You shouldn't feel as though you need to do containers absolutely "the right way", or that the value of containerization is derived from adopting every single tool that you can all at once. The value of containers comes from four very simple things:

  1. It reduces the overhead and increases the performance of co-locating multiple applications on the same hardware,
  2. It forces you to explicitly call out any shared state or required resources,
  3. It creates a complete build pipeline that results in a software artifact that can be run without special installation or set-up instructions (at least, on the "software installation" side; you still might require configuration, of course), and
  4. It gives you a way to test exactly what you're deploying.

These benefits can combine and interact in surprising and interesting ways, and can be enhanced with a wide and growing variety of tools. But underneath all the hype and the buzz, the very real benefit of containerization is basically just that it is fixing a very old design flaw in UNIX.

Containers let you share less state, and shared mutable state is the root of all evil.


  1. If you have a more sophisticated understanding of memory, disks, and networks, you'll notice that everything I'm saying here is patently false, and betrays an overly simplistic understanding of the development of UNIX and the complexities of physical hardware and driver software. Please believe that I know this; this is an alternate history of the version of UNIX that was developed on platonically ideal hardware. The messy co-evolution of UNIX, preemptive multitasking, hardware offload for networks, magnetic secondary storage, and so on, is far too large to fit into the margins of this post. 

  2. When programs break horribly like this, it's called "multithreading". I have written some software to help you avoid it. 

  3. One runs an "executable" to get a process; one runs an "image" to get a container. 

  4. Although the container ecosystem is famously acrimonious, companies in it do actually collaborate better than the tech press sometimes give them credit for; the Open Container Project is a significant extraction of common technology from multiple vendors, many of whom are also competitors, to facilitate a technical substrate that is best for the community. 

  5. If it doesn't seem confusing to you, consider this absolute gem from the hilarious folks over at CircleCI. 

A Container Is A Function Call

There’s something missing from the Docker ecosystem: a type-checker.

It seems to me that the prevailing mental model among users of container technology1 right now is that a container is a tiny little virtual machine. It’s like a machine in the sense that it is provisioned and deprovisioned by explicit decisions, and we talk about “booting” containers. We configure it sort of like we configure a machine; dropping a bunch of files into a volume, setting some environment variables.

In my mind though, a container is something fundamentally different than a VM. Rather than coming from the perspective of “let’s take a VM and make it smaller so we can do cool stuff” - get rid of the kernel, get rid of fixed memory allocations, get rid of emulated memory access and instructions, so we can provision more of them at higher density... I’m coming at it from the opposite direction.

For me, containers are “let’s take a program and made it bigger so we can do cool stuff”. Let’s add in the whole user-space filesystem so it’s got all the same bits every time, so we don’t need to worry about library management, so we can ship it around from computer to computer as a self-contained unit. Awesome!

Of course, there are other ecosystems that figured this out a really long time ago, but having it as a commodity within the most popular server deployment environment has changed things.

Of course, an individual container isn’t a whole program. That’s why we need tools like compose to put containers together into a functioning whole. This makes a container not just a program, but rather, a part of a program. And of course, we all know what the smaller parts of a program are called:

Functions.2

A container of course is not the function itself; the image is the function. A container itself is a function call.

Perceived through this lens, it becomes apparent that Docker is missing some pretty important information. As a tiny VM, it has all the parts you need: it has an operating system (in the docker build) the ability to boot and reboot (docker run), instrumentation (docker inspect) debugging (docker exec) etc. As a really big function, it’s strangely anemic.

Specifically: in every programming language worth its salt, we have a type system; some mechanism to identify what parameters a function will take, and what return value it will have.

You might find this weird coming from a Python person, a language where

1
2
def foo(a, b, c):
    return a.x(c.d(b))

is considered an acceptable level of type documentation by some3; there’s no requirement to say what a, b, and c are. However, just because the type system is implicit, that doesn’t mean it’s not there, even in the text of the program. Let’s consider, from reading this tiny example, what we can discover:

  • foo takes 3 arguments, their names are “a”, “b”, and “c”, and it returns a value.
  • Somewhere else in the codebase there’s an object with an x method, which takes a single argument and also returns a value.
  • The type of <unknown>.x’s argument is the same as the return type of another method somewhere in the codebase, <unknown-2>.d

And so on, and so on. At runtime each of these types takes on a specific, concrete value, with a type, and if you set a breakpoint and single-step into it with a debugger, you can see each of those types very easily. Also at runtime you will get TypeError exceptions telling you exactly what was wrong with what you tried to do at a number of points, if you make a mistake.

The analogy to containers isn’t exact; inputs and outputs aren’t obviously in the shape of “arguments” and “return values”, especially since containers tend to be long-running; but nevertheless, a container does have inputs and outputs in the form of env vars, network services, and volumes.

Let’s consider the “foo” of docker, which would be the middle tier of a 3-tier web application (cribbed from a real live example):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
FROM pypy:2
RUN apt-get update -ym
RUN apt-get upgrade -ym
RUN apt-get install -ym libssl-dev libffi-dev
RUN pip install virtualenv
RUN mkdir -p /code/env
RUN virtualenv /code/env
RUN pwd

COPY requirements.txt /code/requirements.txt
RUN /code/env/bin/pip install -r /code/requirements.txt
COPY main /code/main
RUN chmod a+x /code/main

VOLUME /clf
VOLUME /site
VOLUME /etc/ssl/private

ENTRYPOINT ["/code/main"]

In this file, we can only see three inputs, which are filesystem locations: /clf, /site, and /etc/ssl/private. How is this different than our Python example, a language with supposedly “no type information”?

  • The image has no metadata explaining what might go in those locations, or what roles they serve. We have no way to annotate them within the Dockerfile.
  • What services does this container need to connect to in order to get its job done? What hostnames will it connect to, what ports, and what will it expect to find there? We have no way of knowing. It doesn’t say. Any errors about the failed connections will come in a custom format, possibly in logs, from the application itself, and not from docker.
  • What services does this container export? It could have used an EXPOSE line to give us a hint, but it doesn’t need to; and even if it did, all we’d have is a port number.
  • What environment variables does its code require? What format do they need to be in?
  • We do know that we could look in requirements.txt to figure out what libraries are going to be used, but in order to figure out what the service dependencies are, we’re going to need to read all of the code to all of them.

Of course, the one way that this example is unrealistic is that I deleted all the comments explaining all of those things. Indeed, best practice these days would be to include comments in your Dockerfiles, and include example compose files in your repository, to give users some hint as to how these things all wire together.

This sort of state isn’t entirely uncommon in programming languages. In fact, in this popular GitHub project you can see that large programs written in assembler in the 1960s included exactly this sort of documentation convention: huge front-matter comments in English prose.

That is the current state of the container ecosystem. We are at the “late ’60s assembly language” stage of orchestration development. It would be a huge technological leap forward to be able to communicate our intent structurally.


When you’re building an image, you’re building it for a particular purpose. You already pretty much know what you’re trying to do and what you’re going to need to do it.

  1. When instantiated, the image is going to consume network services. This is not just a matter of hostnames and TCP ports; those services need to be providing a specific service, over a specific protocol. A generic reverse proxy might be able to handle an arbitrary HTTP endpoint, but an API client needs that specific API. A database admin tool might be OK with just “it’s a database” but an application needs a particular schema.
  2. It’s going to consume environment variables. But not just any variables; the variables have to be in a particular format.
  3. It’s going to consume volumes. The volumes need to contain data in a particular format, readable and writable by a particular UID.
  4. It’s also going to produce all of these things; it may listen on a network service port, provision a database schema, or emit some text that needs to be fed back into an environment variable elsewhere.

Here’s a brief sketch of what I want to see in a Dockerfile to allow me to express this sort of thing:

1
2
3
4
5
6
7
8
9
FROM ...
RUN ...

LISTENS ON: TCP:80 FOR: org.ietf.http/com.example.my-application-api
CONNECTS TO: pgwritemaster.internal ON: TCP:5432 FOR: org.postgresql.db/com.example.my-app-schema
CONNECTS TO: {{ETCD_HOST}} ON: TCP:{{ETCD_PORT}} FOR: com.coreos.etcd/client-communication
ENVIRONMENT NEEDS: ETCD_HOST FORMAT: HOST(com.coreos.etcd/client-communication)
ENVIRONMENT NEEDS: ETCD_PORT FORMAT: PORT(com.coreos.etcd/client-communication)
VOLUME AT: /logs FORMAT: org.w3.clf REQUIRES: WRITE UID: 4321

An image thusly built would refuse to run unless:

  • Somewhere else on its network, there was an etcd host/port known to it, its host and port supplied via environment variables.
  • Somewhere else on its network, there was a postgres host, listening on port 5432, with a name-resolution entry of “pgwritemaster.internal”.
  • An environment variable for the etcd configuration was supplied
  • A writable volume for /logs was supplied, owned by user-ID 4321 where it could write common log format logs.

There are probably a lot of flaws in the specific syntax here, but I hope you can see past that, to the broader point that the software inside a container has precise expectations of its environment, and that we presently have no way of communicating those expectations beyond writing a Melvilleian essay in each Dockerfile comments, beseeching those who would run the image to give it what it needs.


Why bother with this sort of work, if all the image can do with it is “refuse to run”?

First and foremost, today, the image effectively won’t run. Oh, it’ll start up, and it’ll consume some resources, but it will break when you try to do anything with it. What this metadata will allow the container runtime to do is to tell you why the image didn’t run, and give you specific, actionable, fast feedback about what you need to do in order to fix the problem. You won’t have to go groveling through logs; which is always especially hard if the back-end service you forgot to properly connect to was the log aggregation service. So this will be an order of magnitude speed improvement on initial deployments and development-environment setups for utility containers. Whole applications typically already come with a compose file, of course, but ideally applications would be built out of functioning self-contained pieces and not assembled one custom container at a time.

Secondly, if there were a strong tooling standard for providing this metadata within the image itself, it might become possible for infrastructure service providers (like, ahem, my employer) to automatically detect and satisfy service dependencies. Right now, if you have a database as a service that lives outside the container system in production, but within the container system in development and test, there’s no way for the orchestration layer to say “good news, everyone! you can find the database you need here: ...”.

My main interest is in allowing open source software developers to give service operators exactly what they need, so the upstream developers can get useful bug reports. There’s a constant tension where volunteer software developers find themselves fielding bug reports where someone deployed their code in a weird way, hacked it up to support some strange environment, built a derived container that had all kinds of extra junk in it to support service discovery or logging or somesuch, and so they don’t want to deal with the support load that that generates. Both people in that exchange are behaving reasonably. The developers gave the ops folks a container that runs their software to the best of their abilities. The service vendors made the minimal modifications they needed to have the container become a part of their service fabric. Yet we arrive at a scenario where nobody feels responsible for the resulting artifact.

If we could just say what it is that the container needs in order to really work, in a way which was precise and machine-readable, then it would be clear where the responsibility lies. Service providers could just run the container unmodified, and they’d know very clearly whether or not they’d satisfied its runtime requirements. Open source developers - or even commercial service vendors! - could say very clearly what they expected to be passed in, and when they got bug reports, they’d know exactly how their service should have behaved.


  1. which mostly but not entirely just means “docker”; it’s weird, of course, because there are pieces that docker depends on and tools that build upon docker which are part of this, but docker remains the nexus. 

  2. Yes yes, I know that they’re not really functions Tristan, they’re subroutines, but that’s the word people use for “subroutines” nowadays. 

  3. Just to be clear: no it isn’t. Write a damn docstring, or at least some type annotations

Python Packaging Is Good Now

setup.py is your friend. It’s real sorry about what happened last time.

Okay folks. Time’s up. It’s too late to say that Python’s packaging ecosystem terrible any more. I’m calling it.

Python packaging is not bad any more. If you’re a developer, and you’re trying to create or consume Python libraries, it can be a tractable, even pleasant experience.

I need to say this, because for a long time, Python’s packaging toolchain was … problematic. It isn’t any more, but a lot of people still seem to think that it is, so it’s time to set the record straight.

If you’re not familiar with the history it went something like this:

The Dawn

Python first shipped in an era when adding a dependency meant a veritable Odyssey into cyberspace. First, you’d wait until nobody in your whole family was using the phone line. Then you’d dial your ISP. Once you’d finished fighting your SLIP or PPP client, you’d ask a netnews group if anyone knew of a good gopher site to find a library that could solve your problem. Once you were done with that task, you’d sign off the Internet for the night, and wait about 48 hours too see if anyone responded. If you were lucky enough to get a reply, you’d set up a download at the end of your night’s web-surfing.

pip search it wasn’t.

For the time, Python’s approach to dependency-handling was incredibly forward-looking. The import statement, and the pluggable module import system, made it easy to get dependencies from wherever made sense.

In Python 2.01, Distutils was introduced. This let Python developers describe their collections of modules abstractly, and added tool support to producing redistributable collections of modules and packages. Again, this was tremendously forward-looking, if somewhat primitive; there was very little to compare it to at the time.

Fast forwarding to 2004; setuptools was created to address some of the increasingly-common tasks that open source software maintainers were facing with distributing their modules over the internet. In 2005, it added easy_install, in order to provide a tool to automate resolving dependencies and downloading them into the right locations.

The Dark Age

Unfortunately, in addition to providing basic utilities for expressing dependencies, setuptools also dragged in a tremendous amount of complexity. Its author felt that import should do something slightly different than what it does, so installing setuptools changed it. The main difference between normal import and setuptools import was that it facilitated having multiple different versions of the same library in the same program at the same time. It turns out that that’s a dumb idea, but in fairness, it wasn’t entirely clear at the time, and it is certainly useful (and necessary!) to be able to have multiple versions of a library installed onto a computer at the same time.

In addition to these idiosyncratic departures from standard Python semantics, setuptools suffered from being unmaintained. It became a critical part of the Python ecosystem at the same time as the author was moving on to other projects entirely outside of programming. No-one could agree on who the new maintainers should be for a long period of time. The project was forked, and many operating systems’ packaging toolchains calcified around a buggy, ancient version.

From 2008 to 2012 or so, Python packaging was a total mess. It was painful to use. It was not clear which libraries or tools to use, which ones were worth investing in or learning. Doing things the simple way was too tedious, and doing things the automated way involved lots of poorly-documented workarounds and inscrutable failure modes.

This is to say nothing of the fact that there were critical security flaws in various parts of this toolchain. There was no practical way to package and upload Python packages in such a way that users didn’t need a full compiler toolchain for their platform.

To make matters worse for the popular perception of Python’s packaging prowess2, at this same time, newer languages and environments were getting a lot of buzz, ones that had packaging built in at the very beginning and had a much better binary distribution story. These environments learned lessons from the screw-ups of Python and Perl, and really got a lot of things right from the start.

Finally, the Python Package Index, the site which hosts all the open source packages uploaded by the Python community, was basically a proof-of-concept that went live way too early, had almost no operational resources, and was offline all the dang time.

Things were looking pretty bad for Python.


Intermission

Here is where we get to the point of this post - this is where popular opinion about Python packaging is stuck. Outdated information from this period abounds. Blog posts complaining about problems score high in web searches. Those who used Python during this time, but have now moved on to some other language, frequently scoff and dismiss Python as impossible to package, its packaging ecosystem as broken, PyPI as down all the time, and so on. Worst of all, bad advice for workarounds which are no longer necessary are still easy to find, which causes users to pre-emptively break their environments where they really don’t need to.


From The Ashes

In the midst of all this brokenness, there were some who were heroically, quietly, slowly fixing the mess, one gnarly bug-report at a time. pip was started, and its various maintainers fixed much of easy_install’s overcomplexity and many of its flaws. Donald Stufft stepped in both on Pip and PyPI and improved the availability of the systems it depended upon, as well as some pretty serious vulnerabilities in the tool itself. Daniel Holth wrote a PEP for the wheel format, which allows for binary redistribution of libraries. In other words, it lets authors of packages which need a C compiler to build give their users a way to not have one.

In 2013, setuptools and distribute un-forked, providing a path forward for operating system vendors to start updating their installations and allowing users to use something modern.

Python Core started distributing the ensurepip module along with both Python 2.7 and 3.3, allowing any user with a recent Python installed to quickly bootstrap into a sensible Python development environment with a one-liner.

A New Renaissance

I won’t give you a full run-down of the state of the packaging art. There’s already a website for that. I will, however, give you a précis of how much easier it is to get started nowadays. Today, if you want to get a sensible, up-to-date python development environment, without administrative privileges, all you have to do is:

1
2
3
$ python -m ensurepip --user
$ python -m pip install --user --upgrade pip
$ python -m pip install --user --upgrade virtualenv

Then, for each project you want to do, make a new virtualenv:

1
2
3
$ python -m virtualenv lets-go
$ . ./lets-go/bin/activate
(lets-go) $ _

From here on out, now the world is your oyster; you can pip install to your heart’s content, and you probably won’t even need to compile any C for most packages. These instructions don’t depend on Python version, either: as long as it’s up-to-date, the same steps work on Python 2, Python 3, PyPy and even Jython. In fact, often the ensurepip step isn’t even necessary since pip comes preinstalled. Running it if it’s unnecessary is harmless, even!

Other, more advanced packaging operations are much simpler than they used to be, too.

  • Need a C compiler? OS vendors have been working with the open source community to make this easier across the board:
    1
    2
    3
    4
    5
    $ apt install build-essential python-dev # ubuntu
    $ xcode-select --install # macOS
    $ dnf install @development-tools python-devel # fedora
    C:\> REM windows
    C:\> start https://www.microsoft.com/en-us/download/details.aspx?id=44266
    

Okay that last one’s not as obvious as it ought to be but they did at least make it freely available!

  • Want to upload some stuff to PyPI? This should do it for almost any project:

    1
    2
    3
    $ pip install twine
    $ python setup.py sdist bdist_wheel
    $ twine upload dist/*
    
  • Want to build wheels for the wild and wooly world of Linux? There’s an app4 for that.

Importantly, PyPI will almost certainly be online. Not only that, but a new, revamped site will be “launching” any day now3.

Again, this isn’t a comprehensive resource; I just want to give you an idea of what’s possible. But, as a deeply experienced Python expert I used to swear at these tools six times a day for years; the most serious Python packaging issue I’ve had this year to date was fixed by cleaning up my git repo to delete a cache file.

Work Still To Do

While the current situation is good, it’s still not great.

Here are just a few of my desiderata:

  • We still need better and more universally agreed-upon tooling for end-user deployments.
  • Pip should have a GUI frontend so that users can write Python stuff without learning as much command-line arcana.
  • There should be tools that help you write and update a setup.py. Or a setup.python.json or something, so you don’t actually need to write code just to ship some metadata.
  • The error messages that you get when you try to build something that needs a C compiler and it doesn’t work should be clearer and more actionable for users who don’t already know what they mean.
  • PyPI should automatically build wheels for all platforms by default when you upload sdists; this is a huge project, of course, but it would be super awesome default behavior.

I could go on. There are lots of ways that Python packaging could be better.

The Bottom Line

The real takeaway here though, is that although it’s still not perfect, other languages are no longer doing appreciably better. Go is still working through a number of different options regarding dependency management and vendoring, and, like Python extensions that require C dependencies, CGo is sometimes necessary and always a problem. Node has had its own well-publicized problems with their dependency management culture and package manager. Hackage is cool and all but everything takes a literal geological epoch to compile.

As always, I’m sure none of this applies to Rust and Cargo is basically perfect, but that doesn’t matter, because nobody reading this is actually using Rust.

My point is not that packaging in any of these languages is particularly bad. They’re all actually doing pretty well, especially compared to the state of the general programming ecosystem a few years ago; many of them are making regular progress towards user-facing improvements.

My point is that any commentary suggesting they’re meaningfully better than Python at this point is probably just out of date. Working with Python packaging is more or less fine right now. It could be better, but lots of people are working on improving it, and the structural problems that prevented those improvements from being adopted by the community in a timely manner have almost all been addressed.

Go! Make some virtualenvs! Hack some setup.pys! If it’s been a while and your last experience was really miserable, I promise, it’s better now.


Am I wrong? Did I screw up a detail of your favorite language? Did I forget to mention the one language environment that has a completely perfect, flawless packaging story? Do you feel the need to just yell at a stranger on the Internet about picayune details? Feel free to get in touch!


  1. released in October, 2000 

  2. say that five times fast. 

  3. although I’m not sure what it means to “launch” when the site is online, and running against the production data-store, and you can use it for pretty much everything... 

  4. “app” meaning of course “docker container” 

The One Python Library Everyone Needs

Use attrs. Use it. Use it for everything.

attrs is still an excellent library, and it still has much to recommend it over the standard library for many applications. However, as of this update in 2023, much new code — perhaps even most of it — should just be using dataclasses. If you want to see where my thinking has moved on to in the intervening 7 years, you may want to check out this more recent post. To make a long story short though, I was right, and attrs set us on the road to a major upgrade to best practices for class definition in Python.

Do you write programs in Python? You should be using attrs.

Why, you ask? Don’t ask. Just use it.

Okay, fine. Let me back up.

I love Python; it’s been my primary programming language for 10+ years and despite a number of interesting developments in the interim I have no plans to switch to anything else.

But Python is not without its problems. In some cases it encourages you to do the wrong thing. Particularly, there is a deeply unfortunate proliferation of class inheritance and the God-object anti-pattern in many libraries.

One cause for this might be that Python is a highly accessible language, so less experienced programmers make mistakes that they then have to live with forever.

But I think that perhaps a more significant reason is the fact that Python sometimes punishes you for trying to do the right thing.

The “right thing” in the context of object design is to make lots of small, self-contained classes that do one thing and do it well. For example, if you notice your object is starting to accrue a lot of private methods, perhaps you should be making those “public”1 methods of a private attribute. But if it’s tedious to do that, you probably won’t bother.

Another place you probably should be defining an object is when you have a bag of related data that needs its relationships, invariants, and behavior explained. Python makes it soooo easy to just define a tuple or a list. The first couple of times you type host, port = ... instead of address = ... it doesn’t seem like a big deal, but then soon enough you’re typing [(family, socktype, proto, canonname, sockaddr)] = ... everywhere and your life is filled with regret. That is, if you’re lucky. If you’re not lucky, you’re just maintaining code that does something like values[0][7][4][HOSTNAME][“canonical”] and your life is filled with garden-variety pain rather than the more complex and nuanced emotion of regret.


This raises the question: is it tedious to make a class in Python? Let’s look at a simple data structure: a 3-dimensional cartesian coordinate. It starts off simply enough:

1
class Point3D(object):

So far so good. We’ve got a 3 dimensional point. What next?

1
2
class Point3D(object):
    def __init__(self, x, y, z):

Well, that’s a bit unfortunate. I just want a holder for a little bit of data, and I’ve already had to override a special method from the Python runtime with an internal naming convention? Not too bad, I suppose; all programming is weird symbols after a fashion.

At least I see my attribute names in there, that makes sense.

1
2
3
class Point3D(object):
    def __init__(self, x, y, z):
        self.x

I already said I wanted an x, but now I have to assign it as an attribute...

1
2
3
class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x

... to x? Uh, obviously ...

1
2
3
4
5
class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

... and now I have to do that once for every attribute, so this actually scales poorly? I have to type every attribute name 3 times?!?

Oh well. At least I’m done now.

1
2
3
4
5
6
class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):

Wait what do you mean I’m not done.

1
2
3
4
5
6
7
8
class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):
        return (self.__class__.__name__ +
                ("(x={}, y={}, z={})".format(self.x, self.y, self.z)))

Oh come on. So I have to type every attribute name 5 times, if I want to be able to see what the heck this thing is when I’m debugging, which a tuple would have given me for free?!?!?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):
        return (self.__class__.__name__ +
                ("(x={}, y={}, z={})".format(self.x, self.y, self.z)))
    def __eq__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) == (other.x, other.y, other.z)

7 times?!?!?!?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):
        return (self.__class__.__name__ +
                ("(x={}, y={}, z={})".format(self.x, self.y, self.z)))
    def __eq__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) == (other.x, other.y, other.z)
    def __lt__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) < (other.x, other.y, other.z)

9 times?!?!?!?!?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from functools import total_ordering
@total_ordering
class Point3D(object):
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
    def __repr__(self):
        return (self.__class__.__name__ +
                ("(x={}, y={}, z={})".format(self.x, self.y, self.z)))
    def __eq__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) == (other.x, other.y, other.z)
    def __lt__(self, other):
        if not isinstance(other, self.__class__):
            return NotImplemented
        return (self.x, self.y, self.z) < (other.x, other.y, other.z)

Okay, whew - 2 more lines of code isn’t great, but now at least we don’t have to define all the other comparison methods. But now we’re done, right?

1
2
from unittest import TestCase
class Point3DTests(TestCase):

You know what? I’m done. 20 lines of code so far and we don’t even have a class that does anything; the hard part of this problem was supposed to be the quaternion solver, not “make a data structure which can be printed and compared”. I’m all in on piles of undocumented garbage tuples, lists, and dictionaries it is; defining proper data structures well is way too hard in Python.2


namedtuple to the (not really) rescue

The standard library’s answer to this conundrum is namedtuple. While a valiant first draft (it bears many similarities to my own somewhat embarrassing and antiquated entry in this genre) namedtuple is unfortunately unsalvageable. It exports a huge amount of undesirable public functionality which would be a huge compatibility nightmare to maintain, and it doesn’t address half the problems that one runs into. A full enumeration of its shortcomings would be tedious, but a few of the highlights:

  • Its fields are accessable as numbered indexes whether you want them to be or not. Among other things, this means you can’t have private attributes, because they’re exposed via the apparently public __getitem__ interface.
  • It compares equal to a raw tuple of the same values, so it’s easy to get into bizarre type confusion, especially if you’re trying to use it to migrate away from using tuples and lists.
  • It’s a tuple, so it’s always immutable. Sort of.

As to that last point, either you can use it like this:

1
Point3D = namedtuple('Point3D', ['x', 'y', 'z'])

in which case it doesn’t look like a type in your code; simple syntax-analysis tools without special cases won’t recognize it as one. You can’t give it any other behaviors this way, since there’s nowhere to put a method. Not to mention the fact that you had to type the class’s name twice.

Alternately you can use inheritance and do this:

1
2
class Point3D(namedtuple('_Point3DBase', 'x y z'.split()])):
    pass

This gives you a place you can put methods, and a docstring, and generally have it look like a class, which it is... but in return you now have a weird internal name (which, by the way, is what shows up in the repr, not the class’s actual name). However, you’ve also silently made the attributes not listed here mutable, a strange side-effect of adding the class declaration; that is, unless you add __slots__ = 'x y z'.split() to the class body, and then we’re just back to typing every attribute name twice.

And this doesn’t even mention the fact that science has proven that you shouldn’t use inheritance.

So, namedtuple can be an improvement if it’s all you’ve got, but only in some cases, and it has its own weird baggage.


Enter The attr

So here’s where my favorite mandatory Python library comes in.

Let’s re-examine the problem above. How do I make Point3D with attrs?

1
2
import attr
@attr.s

Since this isn’t built into the language, we do have to have 2 lines of boilerplate to get us started: the import and the decorator saying we’re about to use it.

1
2
3
import attr
@attr.s
class Point3D(object):

Look, no inheritance! By using a class decorator, Point3D remains a Plain Old Python Class (albeit with some helpful double-underscore methods tacked on, as we’ll see momentarily).

1
2
3
4
import attr
@attr.s
class Point3D(object):
    x = attr.ib()

It has an attribute called x.

1
2
3
4
5
6
import attr
@attr.s
class Point3D(object):
    x = attr.ib()
    y = attr.ib()
    z = attr.ib()

And one called y and one called z and we’re done.

We’re done? Wait. What about a nice string representation?

1
2
>>> Point3D(1, 2, 3)
Point3D(x=1, y=2, z=3)

Comparison?

1
2
3
4
5
6
>>> Point3D(1, 2, 3) == Point3D(1, 2, 3)
True
>>> Point3D(3, 2, 1) == Point3D(1, 2, 3)
False
>>> Point3D(3, 2, 3) > Point3D(1, 2, 3)
True

Okay sure but what if I want to extract the data defined in explicit attributes in a format appropriate for JSON serialization?

1
2
>>> attr.asdict(Point3D(1, 2, 3))
{'y': 2, 'x': 1, 'z': 3}

Maybe that last one was a little on the nose. But nevertheless, it’s one of many things that becomes easier because attrs lets you declare the fields on your class, along with lots of potentially interesting metadata about them, and then get that metadata back out.

1
2
3
4
5
>>> import pprint
>>> pprint.pprint(attr.fields(Point3D))
(Attribute(name='x', default=NOTHING, validator=None, repr=True, cmp=True, hash=True, init=True, convert=None),
 Attribute(name='y', default=NOTHING, validator=None, repr=True, cmp=True, hash=True, init=True, convert=None),
 Attribute(name='z', default=NOTHING, validator=None, repr=True, cmp=True, hash=True, init=True, convert=None))

I am not going to dive into every interesting feature of attrs here; you can read the documentation for that. Plus, it’s well-maintained, so new goodies show up every so often and I might miss something important. But attrs does a few key things that, once you have them, you realize that Python was sorely missing before:

  1. It lets you define types concisely, as opposed to the normally quite verbose manual def __init__.... Types without typing.
  2. It lets you say what you mean directly with a declaration rather than expressing it in a roundabout imperative recipe. Instead of “I have a type, it’s called MyType, it has a constructor, in the constructor I assign the property ‘A’ to the parameter ‘A’ (and so on)”, you say “I have a type, it’s called MyType, it has an attribute called a”, and behavior is derived from that fact, rather than having to later guess about the fact by reverse engineering it from behavior (for example, running dir on an instance, or looking at self.__class__.__dict__).
  3. It provides useful default behavior, as opposed to Python’s sometimes-useful but often-backwards defaults.
  4. It adds a place for you to put a more rigorous implementation later, while starting out simple.

Let’s explore that last point.

Progressive Enhancement

While I’m not going to talk about every feature, I’d be remiss if I didn’t mention a few of them. As you can see from those mile-long repr()s for Attribute above, there are a number of interesting ones.

For example: you can validate attributes when they are passed into an @attr.s-ified class. Our Point3D, for example, should probably contain numbers. For simplicity’s sake, we could say that that means instances of float, like so:

1
2
3
4
5
6
7
import attr
from attr.validators import instance_of
@attr.s
class Point3D(object):
    x = attr.ib(validator=instance_of(float))
    y = attr.ib(validator=instance_of(float))
    z = attr.ib(validator=instance_of(float))

The fact that we were using attrs means we have a place to put this extra validation: we can just add type information to each attribute as we need it. Some of these facilities let us avoid other common mistakes. For example, this is a popular “spot the bug” Python interview question:

1
2
3
4
5
6
7
class Bag:
    def __init__(self, contents=[]):
        self._contents = contents
    def add(self, something):
        self._contents.append(something)
    def get(self):
        return self._contents[:]

Fixing it, of course, becomes this:

1
2
3
4
5
class Bag:
    def __init__(self, contents=None):
        if contents is None:
            contents = []
        self._contents = contents

adding two extra lines of code.

contents inadvertently becomes a global varible here, making all Bag objects not provided with a different list share the same list. With attrs this instead becomes:

1
2
3
4
5
6
7
@attr.s
class Bag:
    _contents = attr.ib(default=attr.Factory(list))
    def add(self, something):
        self._contents.append(something)
    def get(self):
        return self._contents[:]

There are several other features that attrs provides you with opportunities to make your classes both more convenient and more correct. Another great example? If you want to be strict about extraneous attributes on your objects (or more memory-efficient on CPython), you can just pass slots=True at the class level - e.g. @attr.s(slots=True) - to automatically turn your existing attrs declarations a matching __slots__ attribute. All of these handy features allow you to make better and more powerful use of your attr.ib() declarations.


The Python Of The Future

Some people are excited about eventually being able to program in Python 3 everywhere. What I’m looking forward to is being able to program in Python-with-attrs everywhere. It exerts a subtle, but positive, design influence in all the codebases I’ve seen it used in.

Give it a try: you may find yourself surprised at places where you’ll now use a tidily explained class, where previously you might have used a sparsely-documented tuple, list, or a dict, and endure the occasional confusion from co-maintainers. Now that it’s so easy to have structured types that clearly point in the direction of their purpose (in their __repr__, in their __doc__, or even just in the names of their attributes), you might find you’ll use a lot more of them. Your code will be better for it; I know mine has been.


  1. Scare quotes here because the attributes aren’t meaningfully exposed to the caller, they’re just named publicly. This pattern, getting rid of private methods entirely and having only private attributes, probably deserves its own post... 

  2. And we hadn’t even gotten to the really exciting stuff yet: type validation on construction, default mutable values...