Data Classification

Does Python still have a need for class without @dataclass?

Is there a place for non-@dataclass classes in Python any more?

I have previously — and somewhat famously — written favorably about @dataclass’s venerable progenitor, attrs, and how you should use it for pretty much everything.

At the time, attrs was an additional dependency, a piece of technology that you could bolt on to your Python stack to make your particular code better. While I advocated for it strongly, there are all the usual implicit reasons against using a new thing. It was an additional dependency, it might not interoperate with other convenience mechanisms for type declarations that you were already using (i.e. NamedTuple), it might look weird to other Python programmers familiar with existing tools, and so on. I don’t think that any of these were good counterpoints, but there was nevertheless a robust discussion to be had in addressing them all.

But for many years now, dataclasses have been — and currently are — built in to the language. They are increasingly integrated to the toolchain at a deep level that is difficult for application code — or even other specialized tools — to replicate. Everybody knows what they are. Few or none of those reasons apply any longer.

For example, classes defined with @dataclass are now optimized as a C structure might be when you compile them with mypyc, a trick that is extremely useful in some circumstances, which even attrs itself now has trouble keeping up with.

This all raises the question for me: beyond backwards compatibility, is there any point to having non-@dataclass classes any more? Is there any remaining justification for writing them in new code?

Consider my original example, translated from attrs to dataclasses. First, the non-dataclass version:

1
2
3
4
5
class Point3D:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

And now the dataclass one:

1
2
3
4
5
6
7
from dataclasses import dataclass

@dataclass
class Point3D:
    x: int
    y: int
    z: int

Many of my original points still stand. It’s still less repetitive. In fewer characters, we’ve expressed considerably more information, and we get more functionality (repr, sorting, hashing, etc). There doesn’t seem to be much of a downside besides the strictness of the types, and if typing.Any were a builtin, x: any would be fine for those who don’t want to unduly constrain their code.

The one real downside of the latter over the former right now is the need for an import. Which, at this point, just seems… confusing? Wouldn’t it be nicer to be able to just write this:

1
2
3
4
class Point3D:
    x: int
    y: int
    z: int

and not need to faff around with decorator semantics and fudging the difference between Mypy (or Pyright or Pyre) type-check-time and Mypyc or Cython compile time? Or even better, to not need to explain the complexity of all these weird little distinctions to new learners of Python, and to have to cover import before class?

These tools all already treat the @dataclass decorator as a totally special language construct, not really like a decorator at all, so to really explore it you have to explain a special case and then a special case of a special case. The extension hook for this special case of the special case notwithstanding.

If we didn’t want any new syntax, we would need a from __future__ import dataclassification or some such for a while, but this doesn’t seem like an impossible bar to clear.


There are still some folks who don’t like type annotations at all, and there’s still the possibility of awkward implicit changes in meaning when transplanting code from a place with dataclassification enabled to one without, so perhaps an entirely new unambiguous syntax could be provided. One that more closely mirrors the meaning of parentheses in def, moving inheritance (a feature which, whether you like it or not, is clearly far less central to class definitions than ‘what fields do I have’) off to its own part of the syntax:

1
2
3
data Point3D(x: int, y: int, z: int) from Vector:
    def method(self):
        ...

which, for the “I don’t like types” contingent, could reduce to this in the minimal case:

1
2
data Point3D(x, y, z):
    pass

Just thinking pedagogically, I find it super compelling to imagine moving from teaching def foo(x, y, z):... to data Foo(x, y, z):... as opposed to @dataclass class Foo: x: int....

I don’t have any desire for semantic changes to accompany this, just to make it possible for newcomers to ignore the circuitous historical route of the @dataclass syntax and get straight into defining their own types with legible reprs from the very beginning of their Python journey.

(And make it possible for me to skip a couple of lines of boilerplate in short examples, as a bonus.)


I’m curious to know what y’all think, though. Shoot me an email or a toot and let me know.

In particular:

  1. Do you think there’s some reason I’m missing why Python’s current method for defining classes via a bunch of dunder methods is still better than dataclasses, or should stick around into the future for reasons beyond “compatibility”?
  2. Do you think “compatibility” is sufficient reason to keep the syntax the way it is forever, and I’m underestimating the cost of adding a keyword like this?
  3. If you do think that a change should be made, would you prefer:
    1. changing the meaning of class itself via a __future__ import,
    2. a new data keyword like the one I’ve proposed,
    3. a new keyword that functions exactly like the one I have proposed but really want to bikeshed the word data a bunch,
    4. something more incremental like just putting dataclass and field in builtins,
    5. or an option I haven’t even contemplated here?

If I find I’m not alone in this perhaps I will wander over to the Python discussion boards to have a more substantive conversation...


Thank you to my patrons who are helping me while I try to turn… whatever this is… along with open source maintenance and application development, into a real job. Do you want to see me pursue ideas like this one further? If so, you can support my work as a sponsor!

Modularity for Maintenance

Never send a human to do a machine’s job.

Never send a human to do a machine’s job.

One of the best things about maintaining open source in the modern era is that there are so many wonderful, free tools to let machines take care of the busy-work associated with collaboration, code-hosting, continuous integration, code quality maintenance, and so on.

There are lots of great resources that explain how to automate various things that make maintenance easier.

Here are some things you can configure your Python project to do:

  1. Continuous integration, using any one of a number of providers:
    1. GitHub Actions
    2. CircleCI
    3. Azure Pipelines
    4. Appveyor
    5. GitLab CI&CD
    6. Travis CI
  2. Separate multiple test jobs with tox
  3. Lint your code with flake8
  4. Type-Check your code with Mypy
  5. Auto-update your dependencies, with one of:
    1. pyup.io
    2. requires.io, or
    3. Dependabot
  6. automatically find common security issues with Bandit
  7. check the status of your code coverage, with:
    1. Coveralls, or
    2. Codecov
  8. Auto-format your code with:
    1. Black for style
    2. autopep8 to fix common errors
    3. isort to keep your imports tidy
  9. Help your developers remember to do all of those steps with pre-commit
  10. Automatically release your code to PyPI via your CI provider
    1. including automatically building any C code for multiple platforms as a wheel so your users won’t have to
    2. and checking those build artifacts:
      1. to make sure they include all the files they should, with check-manifest
      2. and also that the binary artifacts have the correct dependencies for Linux
      3. and also for macOS
  11. Organize your release notes and versioning with towncrier

All of these tools are wonderful.

But... let’s say you1 maintain a few dozen Python projects. Being a good maintainer, you’ve started splitting up your big monolithic packages into smaller ones, so your utility modules can be commonly shared as widely as possible rather than re-implemented once for each big frameworks. This is great!

However, every one of those numbered list items above is now a task per project that you have to repeat from scratch. So imagine a matrix with all of those down one side and dozens of projects across the top - the full Cartesian product of these little administrative tasks is a tedious and exhausting pile of work.

If you’re lucky enough to start every project close to perfect already, you can skip some of this work, but that partially just front-loads the tedium; plus, projects tend to start quite simple, then gradually escalate in complexity, so it’s helpful to be able to apply these incremental improvements one at a time, as your project gets bigger.

I really wish there were a tool that could take each of these steps and turn them into a quick command-line operation; like, I type pyautomate pypi-upload and the tool notices which CI provider I use, whether I use tox or not, and adds the appropriate configuration entries to both my CI and tox configuration to allow me to do that, possibly prompting me for a secret. Same for pyautomate code-coverage or what have you. All of these automations are fairly straightforward; almost all of the files you need to edit are easily parse-able either as yaml, toml, or ConfigParser2 files.

A few years ago, I asked for this to be added to CookieCutter, but I think the task is just too big and complicated to reasonably expect the existing maintainers to ever get around to it.

If you have a bunch of spare time, and really wanted to turbo-charge the Python open source community, eliminating tons of drag on already-over-committed maintainers, such a tool would be amazing.


  1. and by you, obviously, I mean “I” 

  2. “INI-like files”, I guess? what is this format even called? 

Python Packaging Is Good Now

setup.py is your friend. It’s real sorry about what happened last time.

Okay folks. Time’s up. It’s too late to say that Python’s packaging ecosystem terrible any more. I’m calling it.

Python packaging is not bad any more. If you’re a developer, and you’re trying to create or consume Python libraries, it can be a tractable, even pleasant experience.

I need to say this, because for a long time, Python’s packaging toolchain was … problematic. It isn’t any more, but a lot of people still seem to think that it is, so it’s time to set the record straight.

If you’re not familiar with the history it went something like this:

The Dawn

Python first shipped in an era when adding a dependency meant a veritable Odyssey into cyberspace. First, you’d wait until nobody in your whole family was using the phone line. Then you’d dial your ISP. Once you’d finished fighting your SLIP or PPP client, you’d ask a netnews group if anyone knew of a good gopher site to find a library that could solve your problem. Once you were done with that task, you’d sign off the Internet for the night, and wait about 48 hours too see if anyone responded. If you were lucky enough to get a reply, you’d set up a download at the end of your night’s web-surfing.

pip search it wasn’t.

For the time, Python’s approach to dependency-handling was incredibly forward-looking. The import statement, and the pluggable module import system, made it easy to get dependencies from wherever made sense.

In Python 2.01, Distutils was introduced. This let Python developers describe their collections of modules abstractly, and added tool support to producing redistributable collections of modules and packages. Again, this was tremendously forward-looking, if somewhat primitive; there was very little to compare it to at the time.

Fast forwarding to 2004; setuptools was created to address some of the increasingly-common tasks that open source software maintainers were facing with distributing their modules over the internet. In 2005, it added easy_install, in order to provide a tool to automate resolving dependencies and downloading them into the right locations.

The Dark Age

Unfortunately, in addition to providing basic utilities for expressing dependencies, setuptools also dragged in a tremendous amount of complexity. Its author felt that import should do something slightly different than what it does, so installing setuptools changed it. The main difference between normal import and setuptools import was that it facilitated having multiple different versions of the same library in the same program at the same time. It turns out that that’s a dumb idea, but in fairness, it wasn’t entirely clear at the time, and it is certainly useful (and necessary!) to be able to have multiple versions of a library installed onto a computer at the same time.

In addition to these idiosyncratic departures from standard Python semantics, setuptools suffered from being unmaintained. It became a critical part of the Python ecosystem at the same time as the author was moving on to other projects entirely outside of programming. No-one could agree on who the new maintainers should be for a long period of time. The project was forked, and many operating systems’ packaging toolchains calcified around a buggy, ancient version.

From 2008 to 2012 or so, Python packaging was a total mess. It was painful to use. It was not clear which libraries or tools to use, which ones were worth investing in or learning. Doing things the simple way was too tedious, and doing things the automated way involved lots of poorly-documented workarounds and inscrutable failure modes.

This is to say nothing of the fact that there were critical security flaws in various parts of this toolchain. There was no practical way to package and upload Python packages in such a way that users didn’t need a full compiler toolchain for their platform.

To make matters worse for the popular perception of Python’s packaging prowess2, at this same time, newer languages and environments were getting a lot of buzz, ones that had packaging built in at the very beginning and had a much better binary distribution story. These environments learned lessons from the screw-ups of Python and Perl, and really got a lot of things right from the start.

Finally, the Python Package Index, the site which hosts all the open source packages uploaded by the Python community, was basically a proof-of-concept that went live way too early, had almost no operational resources, and was offline all the dang time.

Things were looking pretty bad for Python.


Intermission

Here is where we get to the point of this post - this is where popular opinion about Python packaging is stuck. Outdated information from this period abounds. Blog posts complaining about problems score high in web searches. Those who used Python during this time, but have now moved on to some other language, frequently scoff and dismiss Python as impossible to package, its packaging ecosystem as broken, PyPI as down all the time, and so on. Worst of all, bad advice for workarounds which are no longer necessary are still easy to find, which causes users to pre-emptively break their environments where they really don’t need to.


From The Ashes

In the midst of all this brokenness, there were some who were heroically, quietly, slowly fixing the mess, one gnarly bug-report at a time. pip was started, and its various maintainers fixed much of easy_install’s overcomplexity and many of its flaws. Donald Stufft stepped in both on Pip and PyPI and improved the availability of the systems it depended upon, as well as some pretty serious vulnerabilities in the tool itself. Daniel Holth wrote a PEP for the wheel format, which allows for binary redistribution of libraries. In other words, it lets authors of packages which need a C compiler to build give their users a way to not have one.

In 2013, setuptools and distribute un-forked, providing a path forward for operating system vendors to start updating their installations and allowing users to use something modern.

Python Core started distributing the ensurepip module along with both Python 2.7 and 3.3, allowing any user with a recent Python installed to quickly bootstrap into a sensible Python development environment with a one-liner.

A New Renaissance

I won’t give you a full run-down of the state of the packaging art. There’s already a website for that. I will, however, give you a précis of how much easier it is to get started nowadays. Today, if you want to get a sensible, up-to-date python development environment, without administrative privileges, all you have to do is:

1
2
3
$ python -m ensurepip --user
$ python -m pip install --user --upgrade pip
$ python -m pip install --user --upgrade virtualenv

Then, for each project you want to do, make a new virtualenv:

1
2
3
$ python -m virtualenv lets-go
$ . ./lets-go/bin/activate
(lets-go) $ _

From here on out, now the world is your oyster; you can pip install to your heart’s content, and you probably won’t even need to compile any C for most packages. These instructions don’t depend on Python version, either: as long as it’s up-to-date, the same steps work on Python 2, Python 3, PyPy and even Jython. In fact, often the ensurepip step isn’t even necessary since pip comes preinstalled. Running it if it’s unnecessary is harmless, even!

Other, more advanced packaging operations are much simpler than they used to be, too.

  • Need a C compiler? OS vendors have been working with the open source community to make this easier across the board:
    1
    2
    3
    4
    5
    $ apt install build-essential python-dev # ubuntu
    $ xcode-select --install # macOS
    $ dnf install @development-tools python-devel # fedora
    C:\> REM windows
    C:\> start https://www.microsoft.com/en-us/download/details.aspx?id=44266
    

Okay that last one’s not as obvious as it ought to be but they did at least make it freely available!

  • Want to upload some stuff to PyPI? This should do it for almost any project:

    1
    2
    3
    $ pip install twine
    $ python setup.py sdist bdist_wheel
    $ twine upload dist/*
    
  • Want to build wheels for the wild and wooly world of Linux? There’s an app4 for that.

Importantly, PyPI will almost certainly be online. Not only that, but a new, revamped site will be “launching” any day now3.

Again, this isn’t a comprehensive resource; I just want to give you an idea of what’s possible. But, as a deeply experienced Python expert I used to swear at these tools six times a day for years; the most serious Python packaging issue I’ve had this year to date was fixed by cleaning up my git repo to delete a cache file.

Work Still To Do

While the current situation is good, it’s still not great.

Here are just a few of my desiderata:

  • We still need better and more universally agreed-upon tooling for end-user deployments.
  • Pip should have a GUI frontend so that users can write Python stuff without learning as much command-line arcana.
  • There should be tools that help you write and update a setup.py. Or a setup.python.json or something, so you don’t actually need to write code just to ship some metadata.
  • The error messages that you get when you try to build something that needs a C compiler and it doesn’t work should be clearer and more actionable for users who don’t already know what they mean.
  • PyPI should automatically build wheels for all platforms by default when you upload sdists; this is a huge project, of course, but it would be super awesome default behavior.

I could go on. There are lots of ways that Python packaging could be better.

The Bottom Line

The real takeaway here though, is that although it’s still not perfect, other languages are no longer doing appreciably better. Go is still working through a number of different options regarding dependency management and vendoring, and, like Python extensions that require C dependencies, CGo is sometimes necessary and always a problem. Node has had its own well-publicized problems with their dependency management culture and package manager. Hackage is cool and all but everything takes a literal geological epoch to compile.

As always, I’m sure none of this applies to Rust and Cargo is basically perfect, but that doesn’t matter, because nobody reading this is actually using Rust.

My point is not that packaging in any of these languages is particularly bad. They’re all actually doing pretty well, especially compared to the state of the general programming ecosystem a few years ago; many of them are making regular progress towards user-facing improvements.

My point is that any commentary suggesting they’re meaningfully better than Python at this point is probably just out of date. Working with Python packaging is more or less fine right now. It could be better, but lots of people are working on improving it, and the structural problems that prevented those improvements from being adopted by the community in a timely manner have almost all been addressed.

Go! Make some virtualenvs! Hack some setup.pys! If it’s been a while and your last experience was really miserable, I promise, it’s better now.


Am I wrong? Did I screw up a detail of your favorite language? Did I forget to mention the one language environment that has a completely perfect, flawless packaging story? Do you feel the need to just yell at a stranger on the Internet about picayune details? Feel free to get in touch!


  1. released in October, 2000 

  2. say that five times fast. 

  3. although I’m not sure what it means to “launch” when the site is online, and running against the production data-store, and you can use it for pretty much everything... 

  4. “app” meaning of course “docker container”