One potato, two potato, three potato, four
Five potato, six potato, seven potato, more.

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Knuth, Donald
“Structured Programming with go to statements”
Computing Surveys, Vol. 6, No. 4, December 1974
(p. 268)
(Emphasis mine)

Knuth’s admonition about premature optimization is such a cliché among software developers at this point that even the correction to include the full context of the quote is itself a a cliché.

Still, it’s a cliché for a reason: the speed at which software can be written is in tension — if not necessarily in conflict — with the speed at which it executes. As Nelson Elhage has explained, software can be qualitatively worse when it is slow, but spending time optimizing an algorithm before getting any feedback from users or profiling the system as a whole can lead one down many blind alleys of wasted effort.

In that same essay, Nelson further elaborates that performant foundations simplify architecture¹. He then follows up with several bits of architectural advice that is highly specific to parsing—compilers and type-checkers specifically—which, while good, is hard to generalize beyond “optimizing performance early can also be good”.

So, here I will endeavor to generalize that advice. How does one provide a performant architectural foundation without necessarily wasting a lot of time on early micro-optimization?

Enter The Potato

Many years before Nelson wrote his excellent aforementioned essay, my father coined a related term: “Potato Programming”.

In modern vernacular, a potato is very slow hardware, and “potato programming” is the software equivalent of the same.

The term comes from the rhyme that opened this essay, and is meant to evoke a slow, childlike counting of individual elements as an algorithm operates upon them. it is an unfortunately quite common software-architectural idiom whereby interfaces are provided in terms of scalar values. In other words, APIs that require you to use for loops or other forms of explicit, individual, non-parallelized iteration. But this is all very abstract; an example might help.

For a generic business-logic example, let’s consider the problem of monthly recurring billing. Every month, we pull in the list of all of all subscriptions to our service, and we bill them.

Since our hypothetical company has an account-management team that owns the UI which updates subscriptions and a billing backend team that writes code to interface with 3rd-party payment providers, we’ll create 2 backends, here represented by some Protocols.

Finally, we’ll have an orchestration layer that puts them together to actually run the billing. I will use async to indicate which things require a network round trip:

class SubscriptionService(Protocol):
    async def all_subscriptions(self) -> AsyncIterable[Subscription]:
        ...

class Subscription(Protocol):
    account_id: str
    to_charge_per_month: money

class BillingService(Protocol):
    async def bill_amount(self, account_id: str, amount: money) -> None:
        ...

To many readers, this may look like an entirely reasonable interface specification; indeed, it looks like a lot of real, public-facing “REST” APIs. An equally apparently-reasonable implementation of our orchestration between them might look like this:

async def billing(s: SubscriptionService, b: BillingService) -> None:
    async for sub in s.all_subscriptions():
        await b.bill_amount(sub.account_id, sub.to_charge_per_month)

This is, however, just about the slowest implementation of this functionality that it’s possible to implement. So, this is the bad version. Let’s talk about the good version: no-tato programming, if you will. But first, some backstory.

Some Backstory

My father began his career as an APL programmer, and one of the key insights he took away from APL’s architecture is that, as he puts it:

Computers like to do things over and over again. They like to do things on arrays. They don’t want to do things on scalars. So, in fact, it’s not possible to write a program that only does things on a scalar. [...] You can’t have an ‘integer’ in APL, you can only have an ‘array of integers’. There’s no ‘loop’s, there’s no ‘map’s.

APL, like Python², is typically executed via an interpreter. Which means, like Python, execution of basic operations like calling functions can be quite slow. However, unlike Python, its pervasive reliance upon arrays meant that almost all of its operations could be safely parallelized, and would only get more and more efficient as more and more parallel hardware was developed.

I said ‘unlike Python’ there, but in fact, my father first related this concept to me regarding a part of the Python ecosystem which follows APL’s design idiom: NumPy. NumPy takes a similar approach: it cannot itself do anything to speed up Python’s fundamental interpreted execution speed³, but it can move the intensive numerical operations that it implements into operations on arrays, rather than operations on individual objects, whether numbers or not.

The performance difference involved in these two styles is not small. Consider this case study which shows a 5828% improvement⁴ when taking an algorithm from idiomatic pure Python to NumPy.

This idiom is also more or less how GPU programming works. GPUs cannot operate on individual values. You submit a program⁵ to the GPU, as well as a large array of data⁶, and the GPU executes the program on that data in parallel across hundreds of tiny cores. Submitting individual values for the GPU to work on would actually be much slower than just doing the work on the CPU directly, due to the bus latency involved to transfer the data back and forth.

Back from the Backstory

This is all interesting for a class of numerical software — and indeeed it works very well there — but it may seem a bit abstract to web backend developers just trying to glue together some internal microservice APIs, or indeed most app developers who aren’t working in those specialized fields. It’s not like Stripe is going to let you run their payment service on your GPU.

However, the lesson generalizes quite well: anywhere you see an API defined in terms of one-potato, two-potato iteration, ask yourself: “how can this be turned into an array”? Let’s go back to our example.

The simplest change that we can make, as a consumer of these potato-shaped APIs, is to submit them in parallel. So if we have to do the optimization in the orchestration layer, we might get something more like this:

from asyncio import Semaphore, AbstractEventLoop

async def one_bill(
    loop: AbstractEventLoop,
    sem: Semaphore,
    sub: Subscription,
    b: BillingService,
) -> None:
    await sem.acquire()
    async def work() -> None:
        try:
            await b.bill_amount(sub.account_id, sub.to_charge_per_month)
        finally:
            sem.release()
    loop.create_task(work)

async def billing(
    loop: AbstractEventLoop,
    s: SubscriptionService,
    b: BillingService,
    batch_size: int,
) -> None:
    sem = Semaphore(batch_size)
    async for sub in s.all_subscriptions():
        await one_bill(loop, sem, sub, b)

This is an improvement, but it’s a bit of a brute-force solution; a multipotato, if you will. We’ve moved the work to the billing service faster, but it still has to do just as much work. Maybe even more work, because now it’s potentially got a lot more lock-contention on its end. And we’re still waiting for the Subscription objects to dribble out of the SubscriptionService potentially one request/response at a time.

In other words, we have used network concurrency as a hack to simulate a performant design. But the back end that we have been given here is not actually optimizable; we do not have a performant foundation. As you can see, we have even had to change our local architecture a little bit here, to include a loop parameter and a batch_size which we had not previously contemplated.

A better-designed interface in the first place would look like this:

class SubscriptionService(Protocol):
    async def all_subscriptions(
        self, batch_size: int,
    ) -> AsyncIterable[Sequence[Subscription]]:
        ...

class Subscription(Protocol):
    account_id: str
    to_charge_per_month: money

@dataclass
class BillingRequest:
    account_id: str
    amount: money

class BillingService(Protocol):
    async def submit_bills(
        self,
        bills: Sequence[BillingRequest],
    ) -> None:
        ...

Superficially, the implementation here looks slightly more awkward than our naive first attempt:

async def billing(s: SubscriptionService, b: BillingService, batch_size: int) -> None:
    async for sub_batch in s.all_subscriptions(batch_size):
        await b.submit_bills(
            [
                BillingRequest(sub.account_id, sub.to_charge_per_month)
                for sub in sub_batch
            ]
        )

However, while the implementation with batching in the backend is approximately as performant as our parallel orchestration implementation, backend batching has a number of advantages over parallel orchestration.

First, backend batching has less internal complexity; no need to have a Semaphore in the orchestration layer, or to create tasks on an event loop. There’s less surface area here for bugs.

Second, and more importantly: backend batching permits for future optimizations within the backend services, which are much closer to the relevant data and can achieve more substantial gains than we can as a client without knowledge of their implementation.

There are many ways this might manifest, but consider that each of these services has their own database, and have got to submit queries and execute transactions on those databases.

In the subscription service, it’s faster to run a single SELECT statement that returns a bunch of results than to select a single result at a time. On the billing service’s end, it’s much faster to issue a single INSERT or UPDATE and then COMMIT for N records at once than to concurrently issue a ton of potentially related modifications in separate transactions.

Potato No Mo

The initial implementation within each of these backends can be as naive and slow as necessary to achieve an MVP. You can do a SELECT … LIMIT 1 internally, if that’s easier, and performance is not important at first. There can be a mountain of potatoes hidden behind the veil of that batched list. In this way, you can avoid the potential trap of premature optimization. Maybe this is a terrible factoring of services for your application in the first place; best to have that prototype in place and functioning quickly so that you can throw it out faster!

However, by initially designing an interface based on lists of things rather than individual things, it’s much easier to hide irrelevant implementation details from the client, and to achieve meaningful improvements when optimizing.

Acknowledgements

This is the first post supported by my patrons, with a topic suggested by a member of my Patreon!

It’s a really good essay, you should read it. ↩
Yes, I know it’s actually bytecode compiled and then run on a custom interpreting VM, but for the purposes of comparing these performance characteristics “interpreted” is a more accurate approximation. Don’t @ me. ↩
Although, thankfully, a lot of folks are now working very hard on that problem. ↩
No, not a typo, that’s a 4-digit improvement. ↩
Typically called a “shader” due to its origins in graphically shading polygons. ↩
The data may rerepresenting vertices in a 3-D mesh, pixels in texture data, or, in the case of general-purpose GPU programming, “just a bunch of floating-point numbers”. ↩