The History
Before dataclasses were added to Python in version 3.7 — in June of 2018 — the
__init__
special method had an important use. If you had a class
representing a data structure — for example a 2DCoordinate
, with x
and y
attributes — you would want to be able to construct it as 2DCoordinate(x=1,
y=2)
, which would require you to add an __init__
method with x
and y
parameters.
The other options available at the time all had pretty bad problems:
- You could remove
2DCoordinate
from your public API and instead expose amake_2d_coordinate
function and make it non-importable, but then how would you document your return or parameter types? - You could document the
x
andy
attributes and make the user assign each one themselves, but then2DCoordinate()
would return an invalid object. - You could default your coordinates to 0 with class attributes, and while
that would fix the problem with option 2, this would now require all
2DCoordinate
objects to be not just mutable, but mutated at every call site. - You could fix the problems with option 1 by adding a new abstract class
that you could expose in your public API, but this would explode the
complexity of every new public class, no matter how simple. To make matters
worse,
typing.Protocol
didn’t even arrive until Python 3.8, so, in the pre-3.7 world this would condemn you to using concrete inheritance and declaring multiple classes even for the most basic data structure imaginable.
Also, an __init__
method that does nothing but assign a few attributes
doesn’t have any significant problems, so it is an obvious choice in this case.
Given all the problems that I just described with the alternatives, it makes
sense that it became the obvious default choice, in most cases.
However, by accepting “define a custom __init__
” as the default way to
allow users to create your objects, we make a habit of beginning every class
with a pile of arbitrary code that gets executed every time it is
instantiated.
Wherever there is arbitrary code, there are arbitrary problems.
The Problems
Let’s consider a data structure more complex than one that simply holds a
couple of attributes. We will create one that represents a reference to some
I/O in the external world: a FileReader
.
Of course Python has its own open-file object abstraction, but I will be ignoring that for the purposes of the example.
Let’s assume a world where we have the following functions, in an imaginary
fileio
module:
open(path: str) -> int
read(fileno: int, length: int)
close(fileno: int)
Our hypothetical fileio.open
returns an integer representing a file
descriptor1, fileio.read
allows us to read length
bytes from an open
file descriptor, and fileio.close
closes that file descriptor, invalidating
it for future use.
With the habit that we have built from writing thousands of __init__
methods,
we might want to write our FileReader
class like this:
1 2 3 4 5 6 7 |
|
For our initial use-case, this is fine. Client code creates a FileReader
by
doing something like FileReader("./config.json")
, which always creates a
FileReader
that maintains its file descriptor int
internally as private
state. This is as it should be; we don’t want user code to see or mess with
_fd
, as that might violate FileReader
’s invariants. All the necessary work
to construct a valid FileReader
— i.e. the call to open
— is always taken
care of for you by FileReader.__init__
.
However, additional requirements will creep in, and as they do,
FileReader.__init__
becomes increasingly awkward.
Initially we only care about fileio.open
, but later, we may have to deal with
a library that has its own reasons for managing the call to fileio.open
by
itself, and wants to give us an int
that we use as our _fd
, we now have to
resort to weird workarounds like:
1 2 3 4 |
|
Now, all those nice properties that we got from trying to force object
construction to give us a valid object are gone. reader_from_fd
’s type
signature, which takes a plain int
, has no way of even suggesting to client
code how to ensure that it has passed in the right kind of int
.
Testing is much more of a hassle, because we have to patch in our own copy of
fileio.open
any time we want an instance of a FileReader
in a test without
doing any real-life file I/O, even if we could (for example) share a single
file descriptor among many FileReader
s for testing purposes.
All of this also assumes a fileio.open
that is synchronous. Although for
literal file I/O this is more of a
hypothetical
concern, there are many types of networked resource which are really only
available via an asynchronous (and thus: potentially slow, potentially
error-prone) API. If you’ve ever found yourself wanting to type async def
__init__(self): ...
then you have seen this limitation in practice.
Comprehensively describing all the possible problems with this approach would end up being a book-length treatise on a philosophy of object oriented design, so I will sum up by saying that the cause of all these problems is the same: we are inextricably linking the act of creating a data structure with whatever side-effects are most often associated with that data structure. If they are “often” associated with it, then by definition they are not “always” associated with it, and all the cases where they aren’t associated become unweildy and potentially broken.
Defining an __init__
is an anti-pattern, and we need a replacement for it.
The Solutions
I believe this tripartite assemblage of design techniques will address the problems raised above:
- using
dataclass
to define attributes, - replacing behavior that previously would have previously been in
__init__
with a new classmethod that does the same thing, and - using precise types to describe what a valid instance looks like.
Using dataclass
attributes to create an __init__
for you
To begin, let’s refactor FileReader
into a dataclass
. This does get us an
__init__
method, but it won’t be one an arbitrary one we define ourselves;
it will get the useful constraint enforced on it that it will just assign
attributes.
1 2 3 4 5 6 7 |
|
Except... oops. In fixing the problems that we created with our custom
__init__
that calls fileio.open
, we have re-introduced several problems
that it solved:
- We have removed all the convenience of
FileReader("path")
. Now the user needs to import the low-levelfileio.open
again, making the most common type of construction both more verbose and less discoverable; if we want users to know how to build aFileReader
in a practical scenario, we will have to add something in our documentation to point at a separate module entirely. - There’s no enforcement of the validity of
_fd
as a file descriptor; it’s just some integer, which the user could easily pass an incorrect instance of, with no error.
In isolation, dataclass
by itself can’t solve all our problems, so let’s add
in the second technique.
Using classmethod
factories to create objects
We don’t want to require any additional imports, or require users to go looking
at any other modules — or indeed anything other than FileReader
itself — to
figure out how to create a FileReader
for its intended usage.
Luckily we have a tool that can easily address all of these concerns at once:
@classmethod
. Let’s define a FileReader.open
class method:
1 2 3 4 5 6 7 |
|
Now, your callers can replace FileReader("path")
with
FileReader.open("path")
, and get all the same benefits.
Additionally, if we needed to await fileio.open(...)
, and thus we needed its
signature to be @classmethod async def open
, we are freed from the constraint
of __init__
as a special method. There is nothing that would prevent a
@classmethod
from being async
, or indeed, from having any other
modification to its return value, such as returning a tuple
of related values
rather than just the object being constructed.
Using NewType
to address object validity
Next, let’s address the slightly trickier issue of enforcing object validity.
Our type signature calls this thing an int
, and indeed, that is unfortunately
what the lower-level fileio.open
gives us, and that’s beyond our control.
But for our own purposes, we can be more precise in our definitions, using
NewType
:
1 2 |
|
There are a few different ways to address the underlying library, but for the
sake of brevity and to illustrate that this can be done with zero run-time
overhead, let’s just insist to Mypy that we have versions of fileio.open
,
fileio.read
, and fileio.write
which actually already take FileDescriptor
integers rather than regular ones.
1 2 3 4 |
|
We do of course have to slightly adjust FileReader
, too, but the changes are
very small. Putting it all together, we get:
1 2 3 4 5 6 7 8 9 10 11 |
|
Note that the main technique here is not necessarily using NewType
specifically, but rather aligning an instance’s property of “has all attributes
set” as closely as possible with an instance’s property of “fully valid
instance of its class”; NewType
is just a handy tool to enforce any necessary
constraints on the places where you need to use a primitive type like int
,
str
or bytes
.
In Summary - The New Best Practice
From now on, when you’re defining a new Python class:
- Make it a dataclass2.
- Use its default
__init__
method3. - Add
@classmethod
s to provide your users convenient and discoverable ways to build your objects. - Require that all dependencies be satisfied by attributes, so you always start with a valid object.
- Use
typing.NewType
to enforce any constraints on primitive data types (likeint
andstr
) which might have magical external attributes, like needing to come from a particular library, needing to be random, and so on.
If you define all your classes this way, you will get all the benefits of a
custom __init__
method:
- All consumers of your data structures will receive valid objects, because an object with all its attributes populated correctly is inherently valid.
- Users of your library will be presented with convenient ways to create your objects that do as much work as is necessary to make them easy to use, and they can discover these just by looking at the methods on your class itself.
Along with some nice new benefits:
- You will be future-proofed against new requirements for different ways that users may need to construct your object.
- If there are already multiple ways to instantiate your class, you can now
give each of them a meaningful name; no need to have monstrosities like
def __init__(self, maybe_a_filename: int | str | None = None):
- Your test suite can always construct an object by satisfying all its dependencies; no need to monkey-patch anything when you can always call the type and never do any I/O or generate any side effects.
Before dataclasses, it was always a bit weird that such a basic feature of the
Python language — giving data to a data structure to make it valid — required
overriding a method with 4 underscores in its name. __init__
stuck out like
a sore thumb. Other such methods like __add__
or even __repr__
were
inherently customizing esoteric attributes of classes.
For many years now, that historical language wart has been
resolved. @dataclass
, @classmethod
, and NewType
give you everything you
need to build classes which are convenient, idiomatic, flexible, testable, and
robust.
Acknowledgments
Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor! I am also available for consulting work if you think your organization could benefit from expertise on topics like “but what is a ‘class’, really?”.
-
If you aren’t already familiar, a “file descriptor” is an integer which has meaning only within your program; you tell the operating system to open a file, it says “I have opened file 7 for you”, and then whenever you refer to “7” it is that file, until you
close(7)
. ↩ -
Or an attrs class, if you’re nasty. ↩
-
Unless you have a really good reason to, of course. Backwards compatibility, or compatibility with another library, might be good reasons to do that. Or certain types of data-consistency validation which cannot be expressed within the type system. The most common example of these would be a class that requires consistency between two different fields, such as a “range” object where
start
must always be less thanend
. There are always exceptions to these types of rules. Still, it’s pretty much never a good idea to do any I/O in__init__
, and nearly all of the remaining stuff that may sometimes be a good idea in edge-cases can be achieved with a__post_init__
rather than writing a literal__init__
. ↩