I am, as you might have guessed, a big fan of dynamic typing. Yet, two
prominent systems I've designed, the
Axiom object database and the
Asynchronous
Messaging Protocol (AMP) have required systems for explicit declarations
of types: at a glance, static typing. Have I gone crazy? Am I
pining for my
glory days as a Java programmer? What's wrong with me?
I believe the economics of in-memory and on-the-wire data structures are
very, very different. In-memory structures are cheap to create and
cheap to fix if you got their structure wrong. Static typing to ensure
their correctness is wasted effort. On the other hand, while
on-the-wire data structures (data structures which you exchange with other
programs) can be equally cheap to create, they can be
exponentially
more expensive to maintain.
When you have an in-memory data structure, it's remarkably flexible.
It is,
almost by
definition, going to be thrown away, so you can afford to change how it
will be represented in subsequent runs of your program. So, when your
compiler
complains at
you for getting the static type declarations wrong, it's just wasting
your time. You have to write unit tests anyway, and static typing
makes unit testing harder. What if you want a test that fakes just the
method
foo
on an interface which also requires
baz
,
boz
, and
qux
, so you can quickly
test a caller of
foo
and move on? A
really good
static type system will just figure that out for you, but it probably needs
to analyze your whole program to do it. Most "statically typed"
languages — such as the ones that actually exist — will force you to write a
huge mess of extra code which doesn't actually
do anything, just so
all your round pegs can pretend to fit into square holes well enough to get
your job done.
But I don't have to convince
you, dear reader. I'm sure the
audience of this blog is already deeply religious on this issue, and they've
got my religion. I'm just trying to make sure you understand I'm not
insane when I get to this next part.
The most important thing that I said about in-memory data structures, above,
is that you throw them away. It's important enough that I'll repeat it
a third time, for emphasis:
you throw them away. As it so
happens, the inverse is the most important property of an on-the-wire data
structure. You can't throw it away. You have to live with
it.
Forever.
Oh, sure, you told your customers that they all
have to upgrade to
protocol version 3.5, but they're still using 3.2. Unless you're
Blizzard Entertainment, you can't tell them to
download the new
version every six weeks or go to hell. Even if you can do that
(and statistically speaking,
you probably aren't Blizzard
Entertainment) you have to keep the old versions of the
updater
protocol around so that when version 4.0 comes out all the laggards who
haven't even run your program since 3.0 can still manage to upgrade.
Here's the best part: your unit tests aren't going help you — at least, not
in the same way they would with your in-memory data. When you change
an in-memory data structure, you aren't supposed to have to change your unit
tests. You want the behavior to stay the same, you don't change the
tests; if they start failing, you know something is wrong. With your new
protocol changes though, you can have tests for the old protocol, and tests
for the new protocol, but every time you make a protocol change
you need
to a new test for every version of the protocol which you still
support. Plus, you probably can't stop supporting older versions
of the protocol (see above).
If you've got a message X[3], and you're introducing X[4], you have to make
sure that X[4] can talk to X[3] and X[2] and X[1]. Each of those is
potentially a new test. Each one is more work. Even worse, it's
possible to introduce X[4] without realizing that you've done it! If
you have a new, optional argument, let's call it "y", to a dynamically-typed
protocol, your old tests (which didn't pass y) will pass. Your new
tests (which do pass y, to the newly-modified X[4]) also pass. But
there's a case which has now arisen which your tests did not detect: y could
be passed to a client which only supports X[3], and an error occurs.
If this were some in-memory structures, that case no longer exists.
There is no version of X
currently in your code which cannot accept
y. Your tests ensure that. You have to time-travel into the past
for your unit tests to discover the code which would cause them to
fail. You can't just do it once, either: maybe X[3] was designed to
ignore all optional parameters. You have to consider X[2] and
X[1]. You have to travel back to all points in time
simultaneously.
This is why I said that the cost is exponential: you carry this cost forward
with each new supported version that gets released. Of course, there
are ways to reduce it. You can design your protocol such that
arguments which your implementation doesn't understand are ignored.
You can start adding version numbers to everything, or change the name of
every message every time some part of its schema changes. All of these
alternatives get tedious after a while.
So what does this have to do with static typing? Static type
declarations can save you a lot of this work. For one thing, it
becomes impossible to forget you're changing the protocol. Did you
change the data's types? If so, you need to add a compatibility
layer. These static type declarations give you key information: what
do the previous versions of the protocol look like? More importantly,
they give your
code key information: is an automatic transformation
between these two versions of the data format possible? (If not, is
the manual transformation between these two versions correct?)
In a dynamically typed program, you can figure out your in-memory types are
doing by running the debugger, inspecting the code that's calling them, and
simply reading the code. Sometimes this can be a bit spread out — in a
badly designed system,
painfully spread out — but the key point is
that all the information you need is right in front of you, in the source
code. If you're working on code that is shipping data elsewhere
without an explicit schema, you have to have a full copy of the
revision history and some
very fancy revision control tools telling
you what the protocol looked like in the past. (Or, perhaps, what the
protocol that some
other piece of software has developed used to look
like in the past.)
Your disk is another kind of wire. This one is particularly brutal,
because while you
might be able to tell someone to download a new
client to be able to access a service, there is
no way you are
ever going to get away with saying "just delete all your data and
start again. there's a new version of the format." When writing
objects to disk (or to a database), you might not be talking across a
network, but you're still talking to a different program. A later
version of the one you're writing now. So these constraints all apply
to Axiom just as they do to AMP; moreso, actually, because in the case of
AMP all the translations can be very simple and ad-hoc, whereas in Axiom the
translations between data types need to be specifically implemented as
upgraders.
With a network involved, you also have to worry about an additional issue of
security. One way to deal with this is by
adding linguistic
support to the notion of untrusted code running "somewhere else", but
type declarations can provide some benefit as well. Let's say that you
have a function that can be invoked by some networked code:
@myprotocol.expose()
def biggger(number):
return number * 1000
Seems simple, seems safe enough, right? 'number' is a number taken
from the network, and you return a result to the network that is 1000 times
bigger. But... what if 'number' were, instead, a list of 10,000
elements? Now you've just consumed a huge amount of memory and sent
the caller 1000 times as much traffic as they've sent you. Dynamic
typing allows the client side of the network connection to pass in whatever
it wants.
Now, let's look at a slightly different implementation of that function:
@myprotocol.expose(number=int)
def biggger(number):
return number * 1000
Now, your protocol code has a critical hint that it needs to make this code
secure. You might spell it differently ("arguments = [('number',
Integer())]" comes to mind), but the idea is that the protocol code now
knows:
if 'number' is not an integer, don't bother to call this
function. You can, of couse, add checks to make sure that all the
methods you want to call on your arguments are safe, but that can get ugly
quickly.
Let's break it down.
Static type declarations have a cost. You (probably) have to type a
bunch of additional names for your types, which makes it difficult to write
code quickly. Therefore it is preferable to avoid that cost.
All the information you need about the code at runtime is present when
you're looking at your codebase. Therefore — although you may find its
form more convenient — static type declarations don't provide any additional
information about the code as it's running. However, information about
the code on opposite ends of the wire may only be in your repository
history, or it may not be in your code at all (it could be in a different
codebase entirely). Therefore static typing provides additional
information for the wire but not in memory.
At runtime, you only have to deal with one version of an object at a
time. On the wire, you might need to deal with a few different
versions simultaneously in the same process. Static type declarations
provide your application with information it may need to interact with those
older versions.
At runtime (at least in today's languages) you aren't worried about security
inside your process. Enforcing type safety at compile time
doesn't really add any security, especially with popular VMs like the JVM
not bothering to enforce type constraints in the bytecode, only in the
compiler. However, static type declarations can help the protocol
implementation understand the expectations of the application code so that
it does not get invoked with confusing or potentially dangerous
values. Therefore static type declarations can add security on the
wire while they can't add security in memory. (It turns out that if
you care about security in memory, you need to do
a bunch of other
stuff, unrelated to type safety. When the rest of the world
catches up to
the E language I may need to
revisit my ideas of how type safety help here.)
If you have data that's being sent to another program, you probably need
static type declarations for that data. Or you need a
lot of
memory to store all those lists I'm about to multiply by 1000 on your
server.