One potato, two potato, three potato, four
Five potato, six potato, seven potato, more.
Traditional Children’s Counting Rhyme
Programmers waste enormous amounts of time thinking about, or worrying about,
the speed of noncritical parts of their programs, and these attempts at
efficiency actually have a strong negative impact when debugging and
maintenance are considered. We should forget about small efficiencies, say
about 97% of the time: premature optimization is the root of all evil. Yet we
should not pass up our opportunities in that critical 3%.
Knuth, Donald
“Structured Programming with go to
statements”
Computing Surveys, Vol. 6, No. 4, December 1974
(p. 268)
(Emphasis mine)
Knuth’s admonition about premature optimization is such a cliché among software
developers at this point that even the correction to include the full context
of the quote is itself a a cliché.
Still, it’s a cliché for a reason: the speed at which software can be written
is in tension — if not necessarily in conflict — with the speed at which it
executes. As Nelson Elhage has explained, software can be qualitatively worse
when it is slow,
but spending time optimizing an algorithm before getting any feedback from
users or profiling the system as a whole can lead one down many blind alleys of
wasted effort.
In that same essay, Nelson further elaborates that performant foundations
simplify
architecture.
He then follows up with several bits of architectural advice that is highly
specific to parsing—compilers and type-checkers specifically—which, while good,
is hard to generalize beyond “optimizing performance early can also be good”.
So, here I will endeavor to generalize that advice. How does one provide a
performant architectural foundation without necessarily wasting a lot of time
on early micro-optimization?
Enter The Potato
Many years before Nelson wrote his excellent aforementioned essay, my
father coined a related term: “Potato
Programming”.
In modern vernacular, a
potato is
very slow hardware, and “potato programming” is the software equivalent of the
same.
The term comes from the rhyme that opened this essay, and is meant to evoke a
slow, childlike counting of individual elements as an algorithm operates upon
them. it is an unfortunately quite common software-architectural idiom
whereby interfaces are provided in terms of scalar values. In other words,
APIs that require you to use for
loops or other forms of explicit,
individual, non-parallelized iteration. But this is all very abstract; an
example might help.
For a generic business-logic example, let’s consider the problem of monthly
recurring billing. Every month, we pull in the list of all of all
subscriptions to our service, and we bill them.
Since our hypothetical company has an account-management team that owns the UI
which updates subscriptions and a billing backend team that writes code to
interface with 3rd-party payment providers, we’ll create 2 backends, here
represented by some Protocol
s.
Finally, we’ll have an orchestration layer that puts them together to actually
run the billing. I will use async
to indicate which things require a network
round trip:
| class SubscriptionService(Protocol):
async def all_subscriptions(self) -> AsyncIterable[Subscription]:
...
class Subscription(Protocol):
account_id: str
to_charge_per_month: money
class BillingService(Protocol):
async def bill_amount(self, account_id: str, amount: money) -> None:
...
|
To many readers, this may look like an entirely reasonable interface
specification; indeed, it looks like a lot of real, public-facing “REST”
APIs. An equally apparently-reasonable implementation of our orchestration
between them might look like this:
| async def billing(s: SubscriptionService, b: BillingService) -> None:
async for sub in s.all_subscriptions():
await b.bill_amount(sub.account_id, sub.to_charge_per_month)
|
This is, however, just about the slowest implementation of this functionality
that it’s possible to implement. So, this is the bad version. Let’s talk
about the good version: no-tato programming, if you will. But first, some backstory.
Some Backstory
My father began his career as an
APL programmer, and
one of the key insights he took away from APL’s architecture is that, as he
puts it:
Computers like to do things over and over again. They like to do things on
arrays. They don’t want to do things on scalars. So, in fact, it’s not
possible to write a program that only does things on a scalar. [...] You
can’t have an ‘integer’ in APL, you can only have an ‘array of integers’.
There’s no ‘loop
’s, there’s no ‘map
’s.
APL, like Python, is typically executed via an interpreter. Which means, like
Python, execution of basic operations like calling functions can be quite slow.
However, unlike Python, its pervasive reliance upon arrays meant that almost
all of its operations could be safely parallelized, and would only get more and
more efficient as more and more parallel hardware was developed.
I said ‘unlike Python’ there, but in fact, my father first related this concept
to me regarding a part of the Python ecosystem which follows APL’s design
idiom: NumPy. NumPy takes a similar approach: it cannot
itself do anything to speed up Python’s fundamental interpreted execution
speed, but it can move the intensive numerical operations that it
implements into operations on arrays, rather than operations on individual
objects, whether numbers or not.
The performance difference involved in these two styles is not small. Consider
this case study which
shows a 5828% improvement when taking an algorithm from idiomatic pure
Python to NumPy.
This idiom is also more or less how GPU programming works. GPUs cannot operate
on individual values. You submit a program to the GPU, as well as a large
array of data, and the GPU executes the program on that data in parallel
across hundreds of tiny cores. Submitting individual values for the GPU to
work on would actually be much slower than just doing the work on the CPU
directly, due to the bus latency involved to transfer the data back and forth.
Back from the Backstory
This is all interesting for a class of numerical software — and indeeed it
works very well there — but it may seem a bit abstract to web backend
developers just trying to glue together some internal microservice APIs, or
indeed most app developers who aren’t working in those specialized fields.
It’s not like Stripe is going to let you run their
payment service on your GPU.
However, the lesson generalizes quite well: anywhere you see an API defined in
terms of one-potato, two-potato iteration, ask yourself: “how can this be
turned into an array”? Let’s go back to our example.
The simplest change that we can make, as a consumer of these potato-shaped
APIs, is to submit them in parallel. So if we have to do the optimization in
the orchestration layer, we might get something more like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 | from asyncio import Semaphore, AbstractEventLoop
async def one_bill(
loop: AbstractEventLoop,
sem: Semaphore,
sub: Subscription,
b: BillingService,
) -> None:
await sem.acquire()
async def work() -> None:
try:
await b.bill_amount(sub.account_id, sub.to_charge_per_month)
finally:
sem.release()
loop.create_task(work)
async def billing(
loop: AbstractEventLoop,
s: SubscriptionService,
b: BillingService,
batch_size: int,
) -> None:
sem = Semaphore(batch_size)
async for sub in s.all_subscriptions():
await one_bill(loop, sem, sub, b)
|
This is an improvement, but it’s a bit of a brute-force solution; a
multipotato, if you will. We’ve moved the work to the billing service faster,
but it still has to do just as much work. Maybe even more work, because now
it’s potentially got a lot more lock-contention on its end. And we’re still
waiting for the Subscription
objects to dribble out of the
SubscriptionService
potentially one request/response at a time.
In other words, we have used network concurrency as a hack to simulate a
performant design. But the back end that we have been given here is not
actually optimizable; we do not have a performant foundation. As you can
see, we have even had to change our local architecture a little bit here, to
include a loop
parameter and a batch_size
which we had not previously
contemplated.
A better-designed interface in the first place would look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 | class SubscriptionService(Protocol):
async def all_subscriptions(
self, batch_size: int,
) -> AsyncIterable[Sequence[Subscription]]:
...
class Subscription(Protocol):
account_id: str
to_charge_per_month: money
@dataclass
class BillingRequest:
account_id: str
amount: money
class BillingService(Protocol):
async def submit_bills(
self,
bills: Sequence[BillingRequest],
) -> None:
...
|
Superficially, the implementation here looks slightly more awkward than our naive first attempt:
| async def billing(s: SubscriptionService, b: BillingService, batch_size: int) -> None:
async for sub_batch in s.all_subscriptions(batch_size):
await b.submit_bills(
[
BillingRequest(sub.account_id, sub.to_charge_per_month)
for sub in sub_batch
]
)
|
However, while the implementation with batching in the backend is approximately
as performant as our parallel orchestration implementation, backend batching
has a number of advantages over parallel orchestration.
First, backend batching has less internal complexity; no need to have a
Semaphore
in the orchestration layer, or to create tasks on an event loop.
There’s less surface area here for bugs.
Second, and more importantly: backend batching permits for future optimizations
within the backend services, which are much closer to the relevant data and
can achieve more substantial gains than we can as a client without knowledge of
their implementation.
There are many ways this might manifest, but consider that each of these
services has their own database, and have got to submit queries and execute
transactions on those databases.
In the subscription service, it’s faster to run a single SELECT
statement
that returns a bunch of results than to select a single result at a time. On
the billing service’s end, it’s much faster to issue a single INSERT
or
UPDATE
and then COMMIT
for N records at once than to concurrently issue a
ton of potentially related modifications in separate transactions.
Potato No Mo
The initial implementation within each of these backends can be as naive and
slow as necessary to achieve an MVP. You can do a SELECT … LIMIT 1
internally, if that’s easier, and performance is not important at first. There
can be a mountain of potatoes hidden behind the veil of that batched list. In
this way, you can avoid the potential trap of premature optimization. Maybe
this is a terrible factoring of services for your application in the first
place; best to have that prototype in place and functioning quickly so that you
can throw it out faster!
However, by initially designing an interface based on lists of things rather
than individual things, it’s much easier to hide irrelevant implementation
details from the client, and to achieve meaningful improvements when
optimizing.
Acknowledgements
This is the first post supported by my patrons, with a
topic suggested by a member of my Patreon!