The Futzing Fraction

At least some of your time with genAI will be spent just kind of… futzing with it.

The most optimistic vision of generative AI1 is that it will relieve us of the tedious, repetitive elements of knowledge work so that we can get to work on the really interesting problems that such tedium stands in the way of. Even if you fully believe in this vision, it’s hard to deny that today, some tedium is associated with the process of using generative AI itself.

Generative AI also isn’t free, and so, as responsible consumers, we need to ask: is it worth it? What’s the ROI of genAI, and how can we tell? In this post, I’d like to explore a logical framework for evaluating genAI expenditures, to determine if your organization is getting its money’s worth.

Perpetually Proffering Permuted Prompts

I think most LLM users would agree with me that a typical workflow with an LLM rarely involves prompting it only one time and getting a perfectly useful answer that solves the whole problem.

Generative AI best practices, even from the most optimistic vendors all suggest that you should continuously evaluate everything. ChatGPT, which is really the only genAI product with significantly scaled adoption, still says at the bottom of every interaction:

ChatGPT can make mistakes. Check important info.

If we have to “check important info” on every interaction, it stands to reason that even if we think it’s useful, some of those checks will find an error. Again, if we think it’s useful, presumably the next thing to do is to perturb our prompt somehow, and issue it again, in the hopes that the next invocation will, by dint of either:

  1. better luck this time with the stochastic aspect of the inference process,
  2. enhanced application of our skill to engineer a better prompt based on the deficiencies of the current inference, or
  3. better performance of the model by populating additional context in subsequent chained prompts.

Unfortunately, given the relative lack of reliable methods to re-generate the prompt and receive a better answer2, checking the output and re-prompting the model can feel like just kinda futzing around with it. You try, you get a wrong answer, you try a few more times, eventually you get the right answer that you wanted in the first place. It’s a somewhat unsatisfying process, but if you get the right answer eventually, it does feel like progress, and you didn’t need to use up another human’s time.

In fact, the hottest buzzword of the last hype cycle is “agentic”. While I have my own feelings about this particular word3, its current practical definition is “a generative AI system which automates the process of re-prompting itself, by having a deterministic program evaluate its outputs for correctness”.

A better term for an “agentic” system would be a “self-futzing system”.

However, the ability to automate some level of checking and re-prompting does not mean that you can fully delegate tasks to an agentic tool, either. It is, plainly put, not safe. If you leave the AI on its own, you will get terrible results that will at best make for a funny story45 and at worst might end up causing serious damage67.

Taken together, this all means that for any consequential task that you want to accomplish with genAI, you need an expert human in the loop. The human must be capable of independently doing the job that the genAI system is being asked to accomplish.

When the genAI guesses correctly and produces usable output, some of the human’s time will be saved. When the genAI guesses wrong and produces hallucinatory gibberish or even “correct” output that nevertheless fails to account for some unstated but necessary property such as security or scale, some of the human’s time will be wasted evaluating it and re-trying it.

Income from Investment in Inference

Let’s evaluate an abstract, hypothetical genAI system that can automate some work for our organization. To avoid implicating any specific vendor, let’s call the system “Mallory”.

Is Mallory worth the money? How can we know?

Logically, there are only two outcomes that might result from using Mallory to do our work.

  1. We prompt Mallory to do some work; we check its work, it is correct, and some time is saved.
  2. We prompt Mallory to do some work; we check its work, it fails, and we futz around with the result; this time is wasted.

As a logical framework, this makes sense, but ROI is an arithmetical concept, not a logical one. So let’s translate this into some terms.

In order to evaluate Mallory, let’s define the Futzing Fraction, “ FF ”, in terms of the following variables:

H

the average amount of time a Human worker would take to do a task, unaided by Mallory

I

the amount of time that Mallory takes to run one Inference8

C

the amount of time that a human has to spend Checking Mallory’s output for each inference

P

the Probability that Mallory will produce a correct inference for each prompt

W

the average amount of time that it takes for a human to Write one prompt for Mallory

E

since we are normalizing everything to time, rather than money, we do also have to account for the dollar of Mallory as as a product, so we will include the Equivalent amount of human time we could purchase for the marginal cost of one9 inference.

As in last week’s example of simple ROI arithmetic, we will put our costs in the numerator, and our benefits in the denominator.

FF = W+I+C+E P H

The idea here is that for each prompt, the minimum amount of time-equivalent cost possible is W+I+C+E. The user must, at least once, write a prompt, wait for inference to run, then check the output; and, of course, pay any costs to Mallory’s vendor.

If the probability of a correct answer is P=13, then they will do this entire process 3 times10, so we put P in the denominator. Finally, we divide everything by H, because we are trying to determine if we are actually saving any time or money, versus just letting our existing human, who has to be driving this process anyway, do the whole thing.

If the Futzing Fraction evaluates to a number greater than 1, as previously discussed, you are a bozo; you’re spending more time futzing with Mallory than getting value out of it.

Figuring out the Fraction is Frustrating

In order to even evaluate the value of the Futzing Fraction though, you have to have a sound method to even get a vague sense of all the terms.

If you are a business leader, a lot of this is relatively easy to measure. You vaguely know what H is, because you know what your payroll costs, and similarly, you can figure out E with some pretty trivial arithmetic based on Mallory’s pricing table. There are endless YouTube channels, spec sheets and benchmarks to give you I. W is probably going to be so small compared to H that it hardly merits consideration11.

But, are you measuring C? If your employees are not checking the outputs of the AI, you’re on a path to catastrophe that no ROI calculation can capture, so it had better be greater than zero.

Are you measuring P? How often does the AI get it right on the first try?

Challenges to Computing Checking Costs

In the fraction defined above, the term C is going to be large. Larger than you think.

Measuring P and C with a high degree of precision is probably going to be very hard; possibly unreasonably so, or too expensive12 to bother with in practice. So you will undoubtedly need to work with estimates and proxy metrics. But you have to be aware that this is a problem domain where your normal method of estimating is going to be extremely vulnerable to inherent cognitive bias, and find ways to measure.

Margins, Money, and Metacognition

First let’s discuss cognitive and metacognitive bias.

My favorite cognitive bias is the availability heuristic and a close second is its cousin salience bias. Humans are empirically predisposed towards noticing and remembering things that are more striking, and to overestimate their frequency.

If you are estimating the variables above based on the vibe that you’re getting from the experience of using an LLM, you may be overestimating its utility.

Consider a slot machine.

If you put a dollar in to a slot machine, and you lose that dollar, this is an unremarkable event. Expected, even. It doesn’t seem interesting. You can repeat this over and over again, a thousand times, and each time it will seem equally unremarkable. If you do it a thousand times, you will probably get gradually more anxious as your sense of your dwindling bank account becomes slowly more salient, but losing one more dollar still seems unremarkable.

If you put a dollar in a slot machine and it gives you a thousand dollars, that will probably seem pretty cool. Interesting. Memorable. You might tell a story about this happening, but you definitely wouldn’t really remember any particular time you lost one dollar.

Luckily, when you arrive at a casino with slot machines, you probably know well enough to set a hard budget in the form of some amount of physical currency you will have available to you. The odds are against you, you’ll probably lose it all, but any responsible gambler will have an immediate, physical representation of their balance in front of them, so when they have lost it all, they can see that their hands are empty, and can try to resist the “just one more pull” temptation, after hitting that limit.

Now, consider Mallory.

If you put ten minutes into writing a prompt, and Mallory gives a completely off-the-rails, useless answer, and you lose ten minutes, well, that’s just what using a computer is like sometimes. Mallory malfunctioned, or hallucinated, but it does that sometimes, everybody knows that. You only wasted ten minutes. It’s fine. Not a big deal. Let’s try it a few more times. Just ten more minutes. It’ll probably work this time.

If you put ten minutes into writing a prompt, and it completes a task that would have otherwise taken you 4 hours, that feels amazing. Like the computer is magic! An absolute endorphin rush.

Very memorable. When it happens, it feels like P=1.

But... did you have a time budget before you started? Did you have a specified N such that “I will give up on Mallory as soon as I have spent N minutes attempting to solve this problem with it”? When the jackpot finally pays out that 4 hours, did you notice that you put 6 hours worth of 10-minute prompt coins into it in?

If you are attempting to use the same sort of heuristic intuition that probably works pretty well for other business leadership decisions, Mallory’s slot-machine chat-prompt user interface is practically designed to subvert those sensibilities. Most business activities do not have nearly such an emotionally variable, intermittent reward schedule. They’re not going to trick you with this sort of cognitive illusion.

Thus far we have been talking about cognitive bias, but there is a metacognitive bias at play too: while Dunning-Kruger, everybody’s favorite metacognitive bias does have some problems with it, the main underlying metacognitive bias is that we tend to believe our own thoughts and perceptions, and it requires active effort to distance ourselves from them, even if we know they might be wrong.

This means you must assume any intuitive estimate of C is going to be biased low; similarly P is going to be biased high. You will forget the time you spent checking, and you will underestimate the number of times you had to re-check.

To avoid this, you will need to decide on a Ulysses pact to provide some inputs to a calculation for these factors that you will not be able to able to fudge if they seem wrong to you.

Problematically Plausible Presentation

Another nasty little cognitive-bias landmine for you to watch out for is the authority bias, for two reasons:

  1. People will tend to see Mallory as an unbiased, external authority, and thereby see it as more of an authority than a similarly-situated human13.
  2. Being an LLM, Mallory will be overconfident in its answers14.

The nature of LLM training is also such that commonly co-occurring tokens in the training corpus produce higher likelihood of co-occurring in the output; they’re just going to be closer together in the vector-space of the weights; that’s, like, what training a model is, establishing those relationships.

If you’ve ever used an heuristic to informally evaluate someone’s credibility by listening for industry-specific shibboleths or ways of describing a particular issue, that skill is now useless. Having ingested every industry’s expert literature, commonly-occurring phrases will always be present in Mallory’s output. Mallory will usually sound like an expert, but then make mistakes at random.15.

While you might intuitively estimate C by thinking “well, if I asked a person, how could I check that they were correct, and how long would that take?” that estimate will be extremely optimistic, because the heuristic techniques you would use to quickly evaluate incorrect information from other humans will fail with Mallory. You need to go all the way back to primary sources and actually fully verify the output every time, or you will likely fall into one of these traps.

Mallory Mangling Mentorship

So far, I’ve been describing the effect Mallory will have in the context of an individual attempting to get some work done. If we are considering organization-wide adoption of Mallory, however, we must also consider the impact on team dynamics. There are a number of possible potential side effects that one might consider when looking at, but here I will focus on just one that I have observed.

I have a cohort of friends in the software industry, most of whom are individual contributors. I’m a programmer who likes programming, so are most of my friends, and we are also (sigh), charitably, pretty solidly middle-aged at this point, so we tend to have a lot of experience.

As such, we are often the folks that the team — or, in my case, the community — goes to when less-experienced folks need answers.

On its own, this is actually pretty great. Answering questions from more junior folks is one of the best parts of a software development job. It’s an opportunity to be helpful, mostly just by knowing a thing we already knew. And it’s an opportunity to help someone else improve their own agency by giving them knowledge that they can use in the future.

However, generative AI throws a bit of a wrench into the mix.

Let’s imagine a scenario where we have 2 developers: Alice, a staff engineer who has a good understanding of the system being built, and Bob, a relatively junior engineer who is still onboarding.

The traditional interaction between Alice and Bob, when Bob has a question, goes like this:

  1. Bob gets confused about something in the system being developed, because Bob’s understanding of the system is incorrect.
  2. Bob formulates a question based on this confusion.
  3. Bob asks Alice that question.
  4. Alice knows the system, so she gives an answer which accurately reflects the state of the system to Bob.
  5. Bob’s understanding of the system improves, and thus he will have fewer and better-informed questions going forward.

You can imagine how repeating this simple 5-step process will eventually transform Bob into a senior developer, and then he can start answering questions on his own. Making sufficient time for regularly iterating this loop is the heart of any good mentorship process.

Now, though, with Mallory in the mix, the process now has a new decision point, changing it from a linear sequence to a flow chart.

We begin the same way, with steps 1 and 2. Bob’s confused, Bob formulates a question, but then:

  1. Bob asks Mallory that question.

Here, our path then diverges into a “happy” path, a “meh” path, and a “sad” path.

The “happy” path proceeds like so:

  1. Mallory happens to formulate a correct answer.
  2. Bob’s understanding of the system improves, and thus he will have fewer and better-informed questions going forward.

Great. Problem solved. We just saved some of Alice’s time. But as we learned earlier,

Mallory can make mistakes. When that happens, we will need to check important info. So let’s get checking:

  1. Mallory happens to formulate an incorrect answer.
  2. Bob investigates this answer.
  3. Bob realizes that this answer is incorrect because it is inconsistent with some of his prior, correct knowledge of the system, or his investigation.
  4. Bob asks Alice the same question; GOTO traditional interaction step 4.

On this path, Bob spent a while futzing around with Mallory, to no particular benefit. This wastes some of Bob’s time, but then again, Bob could have ended up on the happy path, so perhaps it was worth the risk; at least Bob wasn’t wasting any of Alice’s much more valuable time in the process.16

Notice that beginning at the start of step 4, we must begin allocating all of Bob’s time to C, so C already starts getting a bit bigger than if it were just Bob checking Mallory’s output specifically on tasks that Bob is doing.

That brings us to the “sad” path.

  1. Mallory happens to formulate an incorrect answer.
  2. Bob investigates this answer.
  3. Bob does not realize that this answer is incorrect because he is unable to recognize any inconsistencies with his existing, incomplete knowledge of the system.
  4. Bob integrates Mallory’s incorrect information of the system into his mental model.
  5. Bob proceeds to make a larger and larger mess of his work, based on an incorrect mental model.
  6. Eventually, Bob asks Alice a new, worse question, based on this incorrect understanding.
  7. Sadly we cannot return to the happy path at this point, because now Alice must unravel the complex series of confusing misunderstandings that Mallory has unfortunately conveyed to Bob at this point. In the really sad case, Bob actually doesn’t believe Alice for a while, because Mallory seems unbiased17, and Alice has to waste even more time convincing Bob before she can simply explain to him.

Now, we have wasted some of Bob’s time, and some of Alice’s time. Everything from step 5-10 is C, and as soon as Alice gets involved, we are now adding to C at double real-time. If more team members are pulled in to the investigation, you are now multiplying C by the number of investigators, potentially running at triple or quadruple real time.

But That’s Not All

Here I’ve presented a brief selection reasons why C will be both large, and larger than you expect. To review:

  1. Gambling-style mechanics of the user interface will interfere with your own self-monitoring and developing a good estimate.
  2. You can’t use human heuristics for quickly spotting bad answers.
  3. Wrong answers given to junior people who can’t evaluate them will waste more time from your more senior employees.

But this is a small selection of ways that Mallory’s output can cost you money and time. It’s harder to simplistically model second-order effects like this, but there’s also a broad range of possibilities for ways that, rather than simply checking and catching errors, an error slips through and starts doing damage. Or ways in which the output isn’t exactly wrong, but still sub-optimal in ways which can be difficult to notice in the short term.

For example, you might successfully vibe-code your way to launch a series of applications, successfully “checking” the output along the way, but then discover that the resulting code is unmaintainable garbage that prevents future feature delivery, and needs to be re-written18. But this kind of intellectual debt isn’t even specific to technical debt while coding; it can even affect such apparently genAI-amenable fields as LinkedIn content marketing19.

Problems with the Prediction of P

C isn’t the only challenging term though. P, is just as, if not more important, and just as hard to measure.

LLM marketing materials love to phrase their accuracy in terms of a percentage. Accuracy claims for LLMs in general tend to hover around 70%20. But these scores vary per field, and when you aggregate them across multiple topic areas, they start to trend down. This is exactly why “agentic” approaches for more immediately-verifiable LLM outputs (with checks like “did the code work”) got popular in the first place: you need to try more than once.

Independently measured claims about accuracy tend to be quite a bit lower21. The field of AI benchmarks is exploding, but it probably goes without saying that LLM vendors game those benchmarks22, because of course every incentive would encourage them to do that. Regardless of what their arbitrary scoring on some benchmark might say, all that matters to your business is whether it is accurate for the problems you are solving, for the way that you use it. Which is not necessarily going to correspond to any benchmark. You will need to measure it for yourself.

With that goal in mind, our formulation of P must be a somewhat harsher standard than “accuracy”. It’s not merely “was the factual information contained in any generated output accurate”, but, “is the output good enough that some given real knowledge-work task is done and the human does not need to issue another prompt”?

Surprisingly Small Space for Slip-Ups

The problem with reporting these things as percentages at all, however, is that our actual definition for P is 1attempts, where attempts for any given attempt, at least, must be an integer greater than or equal to 1.

Taken in aggregate, if we succeed on the first prompt more often than not, we could end up with a P>12, but combined with the previous observation that you almost always have to prompt it more than once, the practical reality is that P will start at 50% and go down from there.

If we plug in some numbers, trying to be as extremely optimistic as we can, and say that we have a uniform stream of tasks, every one of which can be addressed by Mallory, every one of which:

  • we can measure perfectly, with no overhead
  • would take a human 45 minutes
  • takes Mallory only a single minute to generate a response
  • Mallory will require only 1 re-prompt, so “good enough” half the time
  • takes a human only 5 minutes to write a prompt for
  • takes a human only 5 minutes to check the result of
  • has a per-prompt cost of the equivalent of a single second of a human’s time

Thought experiments are a dicey basis for reasoning in the face of disagreements, so I have tried to formulate something here that is absolutely, comically, over-the-top stacked in favor of the AI optimist here.

Would that be a profitable? It sure seems like it, given that we are trading off 45 minutes of human time for 1 minute of Mallory-time and 10 minutes of human time. If we ask Python:

1
2
3
4
5
>>> def FF(H, I, C, P, W, E):
...     return (W + I + C + E) / (P * H)
... FF(H=45.0, I=1.0, C=5.0, P=1/2, W=5.0, E=0.01)
...
0.48933333333333334

We get a futzing fraction of about 0.4896. Not bad! Sounds like, at least under these conditions, it would indeed be cost-effective to deploy Mallory. But… realistically, do you reliably get useful, done-with-the-task quality output on the second prompt? Let’s bump up the denominator on P just a little bit there, and see how we fare:

1
2
>>> FF(H=45.0, I=1.0, C=5.0, P=1/3, W=5.0, E=0.01)
0.734

Oof. Still cost-effective at 0.734, but not quite as good. Where do we cap out, exactly?

1
2
3
4
5
6
7
8
9
>>> from itertools import count
... for A in count(start=4):
...     print(A, result := FF(H=45.0, I=1.0, C=5.0, P=1 / A, W=5.0, E=1/60.))
...     if result > 1:
...         break
...
4 0.9792592592592594
5 1.224074074074074
>>>

With this little test, we can see that at our next iteration we are already at 0.9792, and by 5 tries per prompt, even in this absolute fever-dream of an over-optimistic scenario, with a futzing fraction of 1.2240, Mallory is now a net detriment to our bottom line.

Harm to the Humans

We are treating H as functionally constant so far, an average around some hypothetical Gaussian distribution, but the distribution itself can also change over time.

Formally speaking, an increase to H would be good for our fraction. Maybe it would even be a good thing; it could mean we’re taking on harder and harder tasks due to the superpowers that Mallory has given us.

But an observed increase to H would probably not be good. An increase could also mean your humans are getting worse at solving problems, because using Mallory has atrophied their skills23 and sabotaged learning opportunities2425. It could also go up because your senior, experienced people now hate their jobs26.

For some more vulnerable folks, Mallory might just take a shortcut to all these complex interactions and drive them completely insane27 directly. Employees experiencing an intense psychotic episode are famously less productive than those who are not.

This could all be very bad, if our futzing fraction eventually does head north of 1 and you need to reconsider introducing human-only workflows, without Mallory.

Abridging the Artificial Arithmetic (Alliteratively)

To reiterate, I have proposed this fraction:

FF = W+I+C+E P H

which shows us positive ROI when FF is less than 1, and negative ROI when it is more than 1.

This model is heavily simplified. A comprehensive measurement program that tests the efficacy of any technology, let alone one as complex and rapidly changing as LLMs, is more complex than could be captured in a single blog post.

Real-world work might be insufficiently uniform to fit into a closed-form solution like this. Perhaps an iterated simulation with variables based on the range of values seem from your team’s metrics would give better results.

However, in this post, I want to illustrate that if you are going to try to evaluate an LLM-based tool, you need to at least include some representation of each of these terms somewhere. They are all fundamental to the way the technology works, and if you’re not measuring them somehow, then you are flying blind into the genAI storm.

I also hope to show that a lot of existing assumptions about how benefits might be demonstrated, for example with user surveys about general impressions, or by evaluating artificial benchmark scores, are deeply flawed.

Even making what I consider to be wildly, unrealistically optimistic assumptions about these measurements, I hope I’ve shown:

  1. in the numerator, C might be a lot higher than you expect,
  2. in the denominator, P might be a lot lower than you expect,
  3. repeated use of an LLM might make H go up, but despite the fact that it's in the denominator, that will ultimately be quite bad for your business.

Personally, I don’t have all that many concerns about E and I. E is still seeing significant loss-leader pricing, and I might not be coming down as fast as vendors would like us to believe, if the other numbers work out I don’t think they make a huge difference. However, there might still be surprises lurking in there, and if you want to rationally evaluate the effectiveness of a model, you need to be able to measure them and incorporate them as well.

In particular, I really want to stress the importance of the influence of LLMs on your team dynamic, as that can cause massive, hidden increases to C. LLMs present opportunities for junior employees to generate an endless stream of chaff that will simultaneously:

  • wreck your performance review process by making them look much more productive than they are,
  • increase stress and load on senior employees who need to clean up unforeseen messes created by their LLM output,
  • and ruin their own opportunities for career development by skipping over learning opportunities.

If you’ve already deployed LLM tooling without measuring these things and without updating your performance management processes to account for the strange distortions that these tools make possible, your Futzing Fraction may be much, much greater than 1, creating hidden costs and technical debt that your organization will not notice until a lot of damage has already been done.

If you got all the way here, particularly if you’re someone who is enthusiastic about these technologies, thank you for reading. I appreciate your attention and I am hopeful that if we can start paying attention to these details, perhaps we can all stop futzing around so much with this stuff and get back to doing real work.

Acknowledgments

Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor!


  1. I do not share this optimism, but I want to try very hard in this particular piece to take it as a given that genAI is in fact helpful. 

  2. If we could have a better prompt on demand via some repeatable and automatable process, surely we would have used a prompt that got the answer we wanted in the first place. 

  3. The software idea of a “user agent” straightforwardly comes from the legal principle of an agent, which has deep roots in common law, jurisprudence, philosophy, and math. When we think of an agent (some software) acting on behalf of a principal (a human user), this historical baggage imputes some important ethical obligations to the developer of the agent software. genAI vendors have been as eager as any software vendor to dodge responsibility for faithfully representing the user’s interests even as there are some indications that at least some courts are not persuaded by this dodge, at least by the consumers of genAI attempting to pass on the responsibility all the way to end users. Perhaps it goes without saying, but I’ll say it anyway: I don’t like this newer interpretation of “agent”. 

  4. “Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents”, Axel Backlund, Lukas Petersson, Feb 20, 2025 

  5. “random thing are happening, maxed out usage on api keys”, @leojr94 on Twitter, Mar 17, 2025 

  6. “New study sheds light on ChatGPT’s alarming interactions with teens” 

  7. “Lawyers submitted bogus case law created by ChatGPT. A judge fined them $5,000”, by Larry Neumeister for the Associated Press, June 22, 2023 

  8. During which a human will be busy-waiting on an answer. 

  9. Given the fluctuating pricing of these products, and fixed subscription overhead, this will obviously need to be amortized; including all the additional terms to actually convert this from your inputs is left as an exercise for the reader. 

  10. I feel like I should emphasize explicitly here that everything is an average over repeated interactions. For example, you might observe that a particular LLM has a low probability of outputting acceptable work on the first prompt, but higher probability on subsequent prompts in the same context, such that it usually takes 4 prompts. For the purposes of this extremely simple closed-form model, we’d still consider that a P of 25%, even though a more sophisticated model, or a monte carlo simulation that sets progressive bounds on the probability, might produce more accurate values. 

  11. No it isn’t, actually, but for the sake of argument let’s grant that it is. 

  12. It’s worth noting that all this expensive measuring itself must be included in C until you have a solid grounding for all your metrics, but let’s optimistically leave all of that out for the sake of simplicity. 

  13. “AI Company Poll Finds 45% of Workers Trust the Tech More Than Their Peers”, by Suzanne Blake for Newsweek, Aug 13, 2025 

  14. AI Chatbots Remain Overconfident — Even When They’re Wrong by Jason Bittel for the Dietrich College of Humanities and Social Sciences at Carnegie Mellon University, July 22, 2025 

  15. AI Mistakes Are Very Different From Human Mistakes by Bruce Schneier and Nathan E. Sanders for IEEE Spectrum, Jan 13, 2025 

  16. Foreshadowing is a narrative device in which a storyteller gives an advance hint of an upcoming event later in the story. 

  17. “People are worried about the misuse of AI, but they trust it more than humans” 

  18. “Why I stopped using AI (as a Senior Software Engineer)”, theSeniorDev YouTube channel, Jun 17, 2025 

  19. “I was an AI evangelist. Now I’m an AI vegan. Here’s why.”, Joe McKay for the greatchatlinkedin YouTube channel, Aug 8, 2025 

  20. “What LLM is The Most Accurate?” 

  21. “Study Finds That 52 Percent Of ChatGPT Answers to Programming Questions are Wrong”, by Sharon Adarlo for Futurism, May 23, 2024 

  22. “Off the Mark: The Pitfalls of Metrics Gaming in AI Progress Races”, by Tabrez Syed on BoxCars AI, Dec 14, 2023 

  23. “I tried coding with AI, I became lazy and stupid”, by Thomasorus, Aug 8, 2025 

  24. “How AI Changes Student Thinking: The Hidden Cognitive Risks” by Timothy Cook for Psychology Today, May 10, 2025 

  25. “Increased AI use linked to eroding critical thinking skills” by Justin Jackson for Phys.org, Jan 13, 2025 

  26. “AI could end my job — Just not the way I expected” by Manuel Artero Anguita on dev.to, Jan 27, 2025 

  27. “The Emerging Problem of “AI Psychosis”” by Gary Drevitch for Psychology Today, July 21, 2025. 

I Think I’m Done Thinking About genAI For Now

The conversation isn’t over, but I don’t think I have much to add to it.

The Problem

Like many other self-styled thinky programmer guys, I like to imagine myself as a sort of Holmesian genius, making trenchant observations, collecting them, and then synergizing them into brilliant deductions with the keen application of my powerful mind.

However, several years ago, I had an epiphany in my self-concept. I finally understood that, to the extent that I am usefully clever, it is less in a Holmesian idiom, and more, shall we say, Monkesque.

For those unfamiliar with either of the respective franchises:

  • Holmes is a towering intellect honed by years of training, who catalogues intentional, systematic observations and deduces logical, factual conclusions from those observations.
  • Monk, on the other hand, while also a reasonably intelligent guy, is highly neurotic, wracked by unresolved trauma and profound grief. As both a consulting job and a coping mechanism, he makes a habit of erratically wandering into crime scenes, and, driven by a carefully managed jenga tower of mental illnesses, leverages his dual inabilities to solve crimes. First, he is unable to filter out apparently inconsequential details, building up a mental rat’s nest of trivia about the problem; second, he is unable to let go of any minor incongruity, obsessively ruminating on the collection of facts until they all make sense in a consistent timeline.

Perhaps surprisingly, this tendency serves both this fictional wretch of a detective, and myself, reasonably well. I find annoying incongruities in abstractions and I fidget and fiddle with them until I end up building something that a lot of people like, or perhaps something that a smaller number of people get really excited about. At worst, at least I eventually understand what’s going on. This is a self-soothing activity but it turns out that, managed properly, it can very effectively soothe others as well.

All that brings us to today’s topic, which is an incongruity I cannot smooth out or fit into a logical framework to make sense. I am, somewhat reluctantly, a genAI skeptic. However, I am, even more reluctantly, exposed to genAI Discourse every damn minute of every damn day. It is relentless, inescapable, and exhausting.

This preamble about personality should hopefully help you, dear reader, to understand how I usually address problematical ideas by thinking and thinking and fidgeting with them until I manage to write some words — or perhaps a new open source package — that logically orders the ideas around it in a way which allows my brain to calm down and let it go, and how that process is important to me.

In this particular instance, however, genAI has defeated me. I cannot make it make sense, but I need to stop thinking about it anyway. It is too much and I need to give up.

My goal with this post is not to convince anyone of anything in particular — and we’ll get to why that is a bit later — but rather:

  1. to set out my current understanding in one place, including all the various negative feelings which are still bothering me, so I can stop repeating it elsewhere,
  2. to explain why I cannot build a case that I think should be particularly convincing to anyone else, particularly to someone who actively disagrees with me,
  3. in so doing, to illustrate why I think the discourse is so fractious and unresolvable, and finally
  4. to give myself, and hopefully by proxy to give others in the same situation, permission to just peace out of this nightmare quagmire corner of the noosphere.

But first, just because I can’t prove that my interlocutors are Wrong On The Internet, doesn’t mean I won’t explain why I feel like they are wrong.

The Anti-Antis

Most recently, at time of writing, there have been a spate of “the genAI discourse is bad” articles, almost exclusively written from the perspective of, not boosters exactly, but pragmatically minded (albeit concerned) genAI users, wishing for the skeptics to be more pointed and accurate in our critiques. This is anti-anti-genAI content.

I am not going to link to any of these, because, as part of their self-fulfilling prophecy about the “genAI discourse”, they’re also all bad.

Mostly, however, they had very little worthwhile to respond to because they were straw-manning their erstwhile interlocutors. They are all getting annoyed at “bad genAI criticism” while failing to engage with — and often failing to even mention — most of the actual substance of any serious genAI criticism. At least, any of the criticism that I’ve personally read.

I understand wanting to avoid a callout or Gish-gallop culture and just express your own ideas. So, I understand that they didn’t link directly to particular sources or go point-by-point on anyone else’s writing. Obviously I get it, since that’s exactly what this post is doing too.

But if you’re going to talk about how bad the genAI conversation is, without even mentioning huge categories of problem like “climate impact” or “disinformation”1 even once, I honestly don’t know what conversation you’re even talking about. This is peak “make up a guy to get mad at” behavior, which is especially confusing in this circumstance, because there’s an absolutely huge crowd of actual people that you could already be mad at.

The people writing these pieces have historically seemed very thoughtful to me. Some of them I know personally. It is worrying to me that their critical thinking skills appear to have substantially degraded specifically after spending a bunch of time intensely using this technology which I believe has a scary risk of degrading one’s critical thinking skills. Correlation is not causation or whatever, and sure, from a rhetorical perspective this is “post hoc ergo propter hoc” and maybe a little “ad hominem” for good measure, but correlation can still be concerning.

Yet, I cannot effectively respond to these folks, because they are making a practical argument that I cannot, despite my best efforts, find compelling evidence to refute categorically. My experiences of genAI are all extremely bad, but that is barely even anecdata. Their experiences are neutral-to-positive. Little scientific data exists. How to resolve this?2

The Aesthetics

As I begin to state my own position, let me lead with this: my factual analysis of genAI is hopelessly negatively biased. I find the vast majority of the aesthetic properties of genAI to be intensely unpleasant.

I have been trying very hard to correct for this bias, to try to pay attention to the facts and to have a clear-eyed view of these systems’ capabilities. But the feelings are visceral, and the effort to compensate is tiring. It is, in fact, the desire to stop making this particular kind of effort that has me writing up this piece and trying to take an intentional break from the subject, despite its intense relevance.

When I say its “aesthetic qualities” are unpleasant, I don’t just mean the aesthetic elements of output of genAIs themselves. The aesthetic quality of genAI writing, visual design, animation and so on, while mostly atrocious, is also highly variable. There are cherry-picked examples which look… fine. Maybe even good. For years now, there have been, famously, literally award-winning aesthetic outputs of genAI3.

While I am ideologically predisposed to see any “good” genAI art as accruing the benefits of either a survivorship bias from thousands of terrible outputs or simple plagiarism rather than its own inherent quality, I cannot deny that in many cases it is “good”.

However, I am not just talking about the product, but the process; the aesthetic experience of interfacing with the genAI system itself, rather than the aesthetic experience of the outputs of that system.

I am not a visual artist and I am not really a writer4, particularly not a writer of fiction or anything else whose experience is primarily aesthetic. So I will speak directly to the experience of software development.

I have seen very few successful examples of using genAI to produce whole, working systems. There are no shortage of highly public miserable failures, particularly from the vendors of these systems themselves, where the outputs are confused, self-contradictory, full of subtle errors and generally unusable. While few studies exist, it sure looks like this is an automated way of producing a Net Negative Productivity Programmer, throwing out chaff to slow down the rest of the team.5

Juxtapose this with my aforementioned psychological motivations, to wit, I want to have everything in the computer be orderly and make sense, I’m sure most of you would have no trouble imagining that sitting through this sort of practice would make me extremely unhappy.

Despite this plethora of negative experiences, executives are aggressively mandating the use of AI6. It looks like without such mandates, most people will not bother to use such tools, so the executives will need muscular policies to enforce its use.7

Being forced to sit and argue with a robot while it struggles and fails to produce a working output, while you have to rewrite the code at the end anyway, is incredibly demoralizing. This is the kind of activity that activates every single major cause of burnout at once.

But, at least in that scenario, the thing ultimately doesn’t work, so there’s a hope that after a very stressful six month pilot program, you can go to management with a pile of meticulously collected evidence, and shut the whole thing down.

I am inclined to believe that, in fact, it doesn’t work well enough to be used this way, and that we are going to see a big crash. But that is not the most aesthetically distressing thing. The most distressing thing is that maybe it does work; if not well enough to actually do the work, at least ambiguously enough to fool the executives long-term.

This project, in particular, stood out to me as an example. Its author, a self-professed “AI skeptic” who “thought LLMs were glorified Markov chain generators that didn’t actually understand code and couldn’t produce anything novel”, did a green-field project to test this hypothesis.

Now, this particular project is not totally inconsistent with a world in which LLMs cannot produce anything novel. One could imagine that, out in the world of open source, perhaps there is enough “OAuth provider written in TypeScript” blended up into the slurry of “borrowed8” training data that the minor constraint of “make it work on Cloudflare Workers” is a small tweak9. It is not fully dispositive of the question of the viability of “genAI coding”.

But it is a data point related to that question, and thus it did make me contend with what might happen if it were actually a fully demonstrative example. I reviewed the commit history, as the author suggested. For the sake of argument, I tried to ask myself if I would like working this way. Just for clarity on this question, I wanted to suspend judgement about everything else; assuming:

  • the model could be created with ethically, legally, voluntarily sourced training data
  • its usage involved consent from labor rather than authoritarian mandates
  • sensible levels of energy expenditure, with minimal CO2 impact
  • it is substantially more efficient to work this way than to just write the code yourself

and so on, and so on… would I like to use this magic robot that could mostly just emit working code for me? Would I use it if it were free, in all senses of the word?

No. I absolutely would not.

I found the experience of reading this commit history and imagining myself using such a tool — without exaggeration — nauseating.

Unlike many programmers, I love code review. I find that it is one of the best parts of the process of programming. I can help people learn, and develop their skills, and learn from them, and appreciate the decisions they made, develop an impression of a fellow programmer’s style. It’s a great way to build a mutual theory of mind.

Of course, it can still be really annoying; people make mistakes, often can’t see things I find obvious, and in particular when you’re reviewing a lot of code from a lot of different people, you often end up having to repeat explanations of the same mistakes. So I can see why many programmers, particularly those more introverted than I am, hate it.

But, ultimately, when I review their code and work hard to provide clear and actionable feedback, people learn and grow and it’s worth that investment in inconvenience.

The process of coding with an “agentic” LLM appears to be the process of carefully distilling all the worst parts of code review, and removing and discarding all of its benefits.

The lazy, dumb, lying robot asshole keeps making the same mistakes over and over again, never improving, never genuinely reacting, always obsequiously pretending to take your feedback on board.

Even when it “does” actually “understand” and manages to load your instructions into its context window, 200K tokens later it will slide cleanly out of its memory and you will have to say it again.

All the while, it is attempting to trick you. It gets most things right, but it consistently makes mistakes in the places that you are least likely to notice. In places where a person wouldn’t make a mistake. Your brain keeps trying to develop a theory of mind to predict its behavior but there’s no mind there, so it always behaves infuriatingly randomly.

I don’t think I am the only one who feels this way.

The Affordances

Whatever our environments afford, we tend to do more of. Whatever they resist, we tend to do less of. So in a world where we were all writing all of our code and emails and blog posts and texts to each other with LLMs, what do they afford that existing tools do not?

As a weirdo who enjoys code review, I also enjoy process engineering. The central question of almost all process engineering is to continuously ask: how shall we shape our tools, to better shape ourselves?

LLMs are an affordance for producing more text, faster. How is that going to shape us?

Again arguing in the alternative here, assuming the text is free from errors and hallucinations and whatever, it’s all correct and fit for purpose, that means it reduces the pain of circumstances where you have to repeat yourself. Less pain! Sounds great; I don’t like pain.

Every codebase has places where you need boilerplate. Every organization has defects in its information architecture that require repetition of certain information rather than a link back to the authoritative source of truth. Often, these problems persist for a very long time, because it is difficult to overcome the institutional inertia required to make real progress rather than going along with the status quo. But this is often where the highest-value projects can be found. Where there’s muck, there’s brass.

The process-engineering function of an LLM, therefore, is to prevent fundamental problems from ever getting fixed, to reward the rapid-fire overwhelm of infrastructure teams with an immediate, catastrophic cascade of legacy code that is now much harder to delete than it is to write.


There is a scene in Game of Thrones where Khal Drogo kills himself. He does so by replacing a stinging, burning, therapeutic antiseptic wound dressing with some cool, soothing mud. The mud felt nice, addressed the immediate pain, removed the discomfort of the antiseptic, and immediately gave him a lethal infection.

The pleasing feeling of immediate progress when one prompts an LLM to solve some problem feels like cool mud on my brain.

The Economics

We are in the middle of a mania around this technology. As I have written about before, I believe the mania will end. There will then be a crash, and a “winter”. But, as I may not have stressed sufficiently, this crash will be the biggest of its kind — so big, that it is arguably not of a kind at all. The level of investment in these technologies is bananas and the possibility that the investors will recoup their investment seems close to zero. Meanwhile, that cost keeps going up, and up, and up.

Others have reported on this in detail10, and I will not reiterate that all here, but in addition to being a looming and scary industry-wide (if we are lucky; more likely it’s probably “world-wide”) economic threat, it is also going to drive some panicked behavior from management.

Panicky behavior from management stressed that their idea is not panning out is, famously, the cause of much human misery. I expect that even in the “good” scenario, where some profit is ultimately achieved, will still involve mass layoffs rocking the industry, panicked re-hiring, destruction of large amounts of wealth.

It feels bad to think about this.

The Energy Usage

For a long time I believed that the energy impact was overstated. I am even on record, about a year ago, saying I didn’t think the energy usage was a big deal. I think I was wrong about that.

It initially seemed like it was letting regular old data centers off the hook. But recently I have learned that, while the numbers are incomplete because the vendors aren’t sharing information, they’re also extremely bad.11

I think there’s probably a version of this technology that isn’t a climate emergency nightmare, but that’s not the version that the general public has access to today.

The Educational Impact

LLMs are making academic cheating incredibly rampant.12

Not only is it so common as to be nearly universal, it’s also extremely harmful to learning.13

For learning, genAI is a forklift at the gym.

To some extent, LLMs are simply revealing a structural rot within education and academia that has been building for decades if not centuries. But it was within those inefficiencies and the inconveniences of the academic experience that real learning was, against all odds, still happening in schools.

LLMs produce a frictionless, streamlined process where students can effortlessly glide through the entire credential, learning nothing. Once again, they dull the pain without regard to its cause.

This is not good.

The Invasion of Privacy

This is obviously only a problem with the big cloud models, but then, the big cloud models are the only ones that people actually use. If you are having conversations about anything private with ChatGPT, you are sending all of that private information directly to Sam Altman, to do with as he wishes.

Even if you don’t think he is a particularly bad guy, maybe he won’t even create the privacy nightmare on purpose. Maybe he will be forced to do so as a result of some bizarre kafkaesque accident.14

Imagine the scenario, for example, where a woman is tracking her cycle and uploading the logs to ChatGPT so she can chat with it about a health concern. Except, surprise, you don’t have to imagine, you can just search for it, as I have personally, organically, seen three separate women on YouTube, at least one of whom lives in Texas, not only do this on camera but recommend doing this to their audiences.

Citation links withheld on this particular claim for hopefully obvious reasons.

I assure you that I am neither particularly interested in menstrual products nor genAI content, and if I am seeing this more than once, it is probably a distressingly large trend.

The Stealing

The training data for LLMs is stolen. I don’t mean like “pirated” in the sense where someone illicitly shares a copy they obtained legitimately; I mean their scrapers are ignoring both norms15 and laws16 to obtain copies under false pretenses, destroying other people’s infrastructure17 in the process.

The Fatigue

I have provided references to numerous articles outlining rhetorical and sometimes data-driven cases for the existence of certain properties and consequences of genAI tools. But I can’t prove any of these properties, either at a point in time or as a durable ongoing problem.

The LLMs themselves are simply too large to model with the usual kind of heuristics one would use to think about software. I’d sooner be able to predict the physics of dice in a casino than a 2 trillion parameter neural network. They resist scientific understanding, not just because of their size and complexity, but because unlike a natural phenomenon (which could of course be considerably larger and more complex) they resist experimentation.

The first form of genAI resistance to experiment is that every discussion is a motte-and-bailey. If I use a free model and get a bad result I’m told it’s because I should have used the paid model. If I get a bad result with ChatGPT I should have used Claude. If I get a bad result with a chatbot I need to start using an agentic tool. If an agentic tool deletes my hard drive by putting os.system(“rm -rf ~/”) into sitecustomize.py then I guess I should have built my own MCP integration with a completely novel heretofore never even considered security sandbox or something?

What configuration, exactly, would let me make a categorical claim about these things? What specific methodological approach should I stick to, to get reliably adequate prompts?

For the record though, if the idea of the free models is that they are going to be provocative demonstrations of the impressive capabilities of the commercial models, and the results are consistently dogshit, I am finding it increasingly hard to care how much better the paid ones are supposed to be, especially since the “better”-ness cannot really be quantified in any meaningful way.

The motte-and-bailey doesn’t stop there though. It’s a war on all fronts. Concerned about energy usage? That’s OK, you can use a local model. Concerned about infringement? That’s okay, somewhere, somebody, maybe, has figured out how to train models consensually18. Worried about the politics of enriching the richest monsters in the world? Don’t worry, you can always download an “open source” model from Hugging Face. It doesn’t matter that many of these properties are mutually exclusive and attempting to fix one breaks two others; there’s always an answer, the field is so abuzz with so many people trying to pull in so many directions at once that it is legitimately difficult to understand what’s going on.

Even here though, I can see that characterizing everything this way is unfair to a hypothetical sort of person. If there is someone working at one of these thousands of AI companies that have been springing up like toadstools after a rain, and they really are solving one of these extremely difficult problems, how can I handwave that away? We need people working on problems, that’s like, the whole point of having an economy. And I really don’t like shitting on other people’s earnest efforts, so I try not to dismiss whole fields. Given how AI has gotten into everything, in a way that e.g. cryptocurrency never did, painting with that broad a brush inevitably ends up tarring a bunch of stuff that isn’t even really AI at all.

The second form of genAI resistance to experiment is the inherent obfuscation of productization. The models themselves are already complicated enough, but the products that are built around the models are evolving extremely rapidly. ChatGPT is not just a “model”, and with the rapid19 deployment of Model Context Protocol tools, the edges of all these things will blur even further. Every LLM is now just an enormous unbounded soup of arbitrary software doing arbitrary whatever. How could I possibly get my arms around that to understand it?

The Challenge

I have woefully little experience with these tools.

I’ve tried them out a little bit, and almost every single time the result has been a disaster that has not made me curious to push further. Yet, I keep hearing from all over the industry that I should.

To some extent, I feel like the motte-and-bailey characterization above is fair; if the technology itself can really do real software development, it ought to be able to do it in multiple modalities, and there’s nothing anyone can articulate to me about GPT-4o which puts it in a fundamentally different class than GPT-3.5.

But, also, I consistently hear that the subjective experience of using the premium versions of the tools is actually good, and the free ones are actually bad.

I keep struggling to find ways to try them “the right way”, the way that people I know and otherwise respect claim to be using them, but I haven’t managed to do so in any meaningful way yet.

I do not want to be using the cloud versions of these models with their potentially hideous energy demands; I’d like to use a local model. But there is obviously not a nicely composed way to use local models like this.

Since there are apparently zero models with ethically-sourced training data, and litigation is ongoing20 to determine the legal relationships of training data and outputs, even if I can be comfortable with some level of plagiarism on a project, I don’t feel that I can introduce the existential legal risk into other people’s infrastructure, so I would need to make a new project.

Others have differing opinions of course, including some within my dependency chain, which does worry me, but I still don’t feel like I can freely contribute further to the problem; it’s going to be bad enough to unwind any impact upstream. Even just for my own sake, I don’t want to make it worse.

This especially presents a problem because I have way too much stuff going on already. A new project is not practical.

Finally, even if I did manage to satisfy all of my quirky21 constraints, would this experiment really be worth anything? The models and tools that people are raving about are the big, expensive, harmful ones. If I proved to myself yet again that a small model with bad tools was unpleasant to use, I wouldn’t really be addressing my opponents’ views.

I’m stuck.

The Surrender

I am writing this piece to make my peace with giving up on this topic, at least for a while. While I do idly hope that some folks might find bits of it convincing, and perhaps find ways to be more mindful with their own usage of genAI tools, and consider the harm they may be causing, that’s not actually the goal. And that is not the goal because it is just so much goddamn work to prove.

Here, I must return to my philosophical hobbyhorse of sprachspiel. In this case, specifically to use it as an analytical tool, not just to understand what I am trying to say, but what the purpose for my speech is.

The concept of sprachspiel is most frequently deployed to describe the goal of the language game being played, but in game theory, that’s only half the story. Speech — particularly rigorously justified speech — has a cost, as well as a benefit. I can make shit up pretty easily, but if I want to do anything remotely like scientific or academic rigor, that cost can be astronomical. In the case of developing an abstract understanding of LLMs, the cost is just too high.

So what is my goal, then? To be king Canute, standing astride the shore of “tech”, whatever that is, commanding the LLM tide not to rise? This is a multi-trillion dollar juggernaut.

Even the rump, loser, also-ran fragment of it has the power to literally suffocate us in our homes22 if they so choose, completely insulated from any consequence. If the power curve starts there, imagine what the winners in this industry are going to be capable of, irrespective of the technology they’re building - just with the resources they have to hand. Am I going to write a blog post that can rival their propaganda apparatus? Doubtful.

Instead, I will just have to concede that maybe I’m wrong. I don’t have the skill, or the knowledge, or the energy, to demonstrate with any level of rigor that LLMs are generally, in fact, hot garbage. Intellectually, I will have to acknowledge that maybe the boosters are right. Maybe it’ll be OK.

Maybe the carbon emissions aren’t so bad. Maybe everybody is keeping them secret in ways that they don’t for other types of datacenter for perfectly legitimate reasons. Maybe the tools really can write novel and correct code, and with a little more tweaking, it won’t be so difficult to get them to do it. Maybe by the time they become a mandatory condition of access to developer tools, they won’t be miserable.

Sure, I even sincerely agree, intellectual property really has been a pretty bad idea from the beginning. Maybe it’s OK that we’ve made an exception to those rules. The rules were stupid anyway, so what does it matter if we let a few billionaires break them? Really, everybody should be able to break them (although of course, regular people can’t, because we can’t afford the lawyers to fight off the MPAA and RIAA, but that’s a problem with the legal system, not tech).

I come not to praise “AI skepticism”, but to bury it.

Maybe it really is all going to be fine. Perhaps I am simply catastrophizing; I have been known to do that from time to time. I can even sort of believe it, in my head. Still, even after writing all this out, I can’t quite manage to believe it in the pit of my stomach.

Unfortunately, that feeling is not something that you, or I, can argue with.


Acknowledgments

Thank you to my patrons. Normally, I would say, “who are supporting my writing on this blog”, but in the case of this piece, I feel more like I should apologize to them for this than to thank them; these thoughts have been preventing me from thinking more productive, useful things that I actually have relevant skill and expertise in; this felt more like a creative blockage that I just needed to expel than a deliberately written article. If you like what you’ve read here and you’d like to read more of it, well, too bad; I am sincerely determined to stop writing about this topic. But, if you’d like to read more stuff like other things I have written, or you’d like to support my various open-source endeavors, you can support my work as a sponsor!


  1. And yes, disinformation is still an issue even if you’re “just” using it for coding. Even sidestepping the practical matter that technology is inherently political, validation and propagation of poor technique is a form of disinformation

  2. I can’t resolve it, that’s the whole tragedy here, but I guess we have to pretend I will to maintain narrative momentum here. 

  3. The story in Creative Bloq, or the NYT, if you must 

  4. although it’s not for lack of trying, Jesus, look at the word count on this 

  5. These are sometimes referred to as “10x” programmers, because they make everyone around them 10x slower. 

  6. Douglas B. Laney at Forbes, Viral Shopify CEO Manifesto Says AI Now Mandatory For All Employees 

  7. The National CIO Review, AI Mandates, Minimal Use: Closing the Workplace Readiness Gap 

  8. Matt O’Brien at the AP, Reddit sues AI company Anthropic for allegedly ‘scraping’ user comments to train chatbot Claude 

  9. Using the usual tricks to find plagiarism like searching for literal transcriptions of snippets of training data did not pull up anything when I tried, but then, that’s not how LLMs work these days, is it? If it didn’t obfuscate the plagiarism it wouldn’t be a very good plagiarism-obfuscator. 

  10. David Gerard at Pivot to AI, “Microsoft and AI: spending billions to make millions”, Edward Zitron at Where’s Your Ed At, “The Era Of The Business Idiot”, both sobering reads 

  11. James O’Donnell and Casey Crownhart at the MIT Technology Review, We did the math on AI’s energy footprint. Here’s the story you haven’t heard. 

  12. Lucas Ropek at Gizmodo, AI Cheating Is So Out of Hand In America’s Schools That the Blue Books Are Coming Back 

  13. James D. Walsh at the New York Magazine Intelligencer, Everyone Is Cheating Their Way Through College 

  14. Ashley Belanger at Ars Technica, OpenAI slams court order to save all ChatGPT logs, including deleted chats 

  15. Ashley Belanger at Ars Technica, AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt 

  16. Blake Brittain at Reuters, Judge in Meta case warns AI could ‘obliterate’ market for original works 

  17. Xkeeper, TCRF has been getting DDoSed 

  18. Kate Knibbs at Wired, Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content 

  19. and, I should note, extremely irresponsible 

  20. Porter Anderson at Publishing Perspectives, Meta AI Lawsuit: US Publishers File Amicus Brief 

  21. It feels bizarre to characterize what feel like baseline ethical concerns this way, but the fact remains that within the “genAI community”, this places me into a tiny and obscure minority. 

  22. Ariel Wittenberg for Politico, ‘How come I can’t breathe?’: Musk’s data company draws a backlash in Memphis 

A Bigger Database

We do what we can, because we must.

A Database File

When I was 10 years old, and going through a fairly difficult time, I was lucky enough to come into the possession of a piece of software called Claris FileMaker Pro™.

FileMaker allowed its users to construct arbitrary databases, and to associate their tables with a customized visual presentation. FileMaker also had a rudimentary scripting language, which would allow users to imbue these databases with behavior.

As a mentally ill pre-teen, lacking a sense of control over anything or anyone in my own life, including myself, I began building a personalized database to catalogue the various objects and people in my immediate vicinity. If one were inclined to be generous, one might assess this behavior and say I was systematically taxonomizing the objects in my life and recording schematized information about them.

As I saw it at the time, if I collected the information, I could always use it later, to answer questions that I might have. If I didn’t collect it, then what if I needed it? Surely I would regret it! Thus I developed a categorical imperative to spend as much of my time as possible collecting and entering data about everything that I could reasonably arrange into a common schema.

Having thus summoned this specter of regret for all lost data-entry opportunities, it was hard to dismiss. We might label it “Claris’s Basilisk”, for obvious reasons.

Therefore, a less-generous (or more clinically-minded) observer might have replaced the word “systematically” with “obsessively” in the assessment above.

I also began writing what scripts were within my marginal programming abilities at the time, just because I could: things like computing the sum of every street number of every person in my address book. Why was this useful? Wrong question: the right question is “was it possible” to which my answer was “yes”.

If I was obliged to collect all the information which I could observe — in case it later became interesting — I was similarly obliged to write and run every program I could. It might, after all, emit some other interesting information.

I was an avid reader of science fiction as well.

I had this vague sense that computers could kind of think. This resulted in a chain of reasoning that went something like this:

  1. human brains are kinda like computers,
  2. the software running in the human brain is very complex,
  3. I could only write simple computer programs, but,
  4. when you really think about it, a “complex” program is just a collection of simpler programs

Therefore: if I just kept collecting data, collecting smaller programs that could solve specific problems, and connecting them all together in one big file, eventually the database as a whole would become self-aware and could solve whatever problem I wanted. I just needed to be patient; to “keep grinding” as the kids would put it today.

I still feel like this is an understandable way to think — if you are a highly depressed and anxious 10-year-old in 1990.

Anyway.


35 Years Later

OpenAI is a company that produces transformer architecture machine learning generative AI models; their current generation was trained on about 10 trillion words, obtained in a variety of different ways from a large variety of different, unrelated sources.

A few days ago, on March 26, 2025 at 8:41 AM Pacific Time, Sam Altman took to “X™, The Everything App™,” and described the trajectory of his career of the last decade at OpenAI as, and I quote, a “grind for a decade trying to help make super-intelligence to cure cancer or whatever” (emphasis mine).

I really, really don’t want to become a full-time AI skeptic, and I am not an expert here, but I feel like I can identify a logically flawed premise when I see one.

This is not a system-design strategy. It is a trauma response.

You can’t cure cancer “or whatever”. If you want to build a computer system that does some thing, you actually need to hire experts in that thing, and have them work to both design and validate that the system is fit for the purpose of that thing.


Aside: But... are they, though?

I am not an oncologist; I do not particularly want to be writing about the specifics here, but, if I am going to make a claim like “you can’t cure cancer this way” I need to back it up.

My first argument — and possibly my strongest — is that cancer is not cured.

QED.

But I guess, to Sam’s credit, there is at least one other company partnering with OpenAI to do things that are specifically related to cancer. However, that company is still in a self-described “initial phase” and it’s not entirely clear that it is going to work out very well.

Almost everything I can find about it online was from a PR push in the middle of last year, so it all reads like a press release. I can’t easily find any independently-verified information.

A lot of AI hype is like this. A promising demo is delivered; claims are made that surely if the technology can solve this small part of the problem now, within 5 years surely it will be able to solve everything else as well!

But even the light-on-content puff-pieces tend to hedge quite a lot. For example, as the Wall Street Journal quoted one of the users initially testing it (emphasis mine):

The most promising use of AI in healthcare right now is automating “mundane” tasks like paperwork and physician note-taking, he said. The tendency for AI models to “hallucinate” and contain bias presents serious risks for using AI to replace doctors. Both Color’s Laraki and OpenAI’s Lightcap are adamant that doctors be involved in any clinical decisions.

I would probably not personally characterize “‘mundane’ tasks like paperwork and … note-taking” as “curing cancer”. Maybe an oncologist could use some code I developed too; even if it helped them, I wouldn’t be stealing valor from them on the curing-cancer part of their job.

Even fully giving it the benefit of the doubt that it works great, and improves patient outcomes significantly, this is medical back-office software. It is not super-intelligence.

It would not even matter if it were “super-intelligence”, whatever that means, because “intelligence” is not how you do medical care or medical research. It’s called “lab work” not “lab think”.

To put a fine point on it: biomedical research fundamentally cannot be done entirely by reading papers or processing existing information. It cannot even be done by testing drugs in computer simulations.

Biological systems are enormously complex, and medical research on new therapies inherently requires careful, repeated empirical testing to validate the correspondence of existing research with reality. Not “an experiment”, but a series of coordinated experiments that all test the same theoretical model. The data (which, in an LLM context, is “training data”) might just be wrong; it may not reflect reality, and the only way to tell is to continuously verify it against reality.

Previous observations can be tainted by methodological errors, by data fraud, and by operational mistakes by practitioners. If there were a way to do verifiable development of new disease therapies without the extremely expensive ladder going from cell cultures to animal models to human trials, we would already be doing it, and “AI” would just be an improvement to efficiency of that process. But there is no way to do that and nothing about the technologies involved in LLMs is going to change that fact.


Knowing Things

The practice of science — indeed any practice of the collection of meaningful information — must be done by intentionally and carefully selecting inclusion criteria, methodically and repeatedly curating our data, building a model that operates according to rules we understand and can verify, and verifying the data itself with repeated tests against nature. We cannot just hoover up whatever information happens to be conveniently available with no human intervention and hope it resolves to a correct model of reality by accident. We need to look where the keys are, not where the light is.

Piling up more and more information in a haphazard and increasingly precarious pile will not allow us to climb to the top of that pile, all the way to heaven, so that we can attack and dethrone God.

Eventually, we’ll just run out of disk space, and then lose the database file when the family gets a new computer anyway.


Acknowledgments

Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor! Special thanks also to Itamar Turner-Trauring and Thomas Grainger for pre-publication feedback on this article; any errors of course remain my own.

A Grand Unified Theory of the AI Hype Cycle

I’m sorry, but as an AI language model, I cannot repeat history exactly. However, I can rhyme with it.

The Cycle

The history of AI goes in cycles, each of which looks at least a little bit like this:

  1. Scientists do some basic research and develop a promising novel mechanism, N. One important detail is that N has a specific name; it may or may not be carried out under the general umbrella of “AI research” but it is not itself “AI”. N always has a few properties, but the most common and salient one is that it initially tends to require about 3x the specifications of the average computer available to the market at the time; i.e., it requires three times as much RAM, CPU, and secondary storage as is shipped in the average computer.
  2. Research and development efforts begin to get funded on the hypothetical potential of N. Because N is so resource intensive, this funding is used to purchase more computing capacity (RAM, CPU, storage) for the researchers, which leads to immediate results, as the technology was previously resource constrained.
  3. Initial successes in the refinement of N hint at truly revolutionary possibilities for its deployment. These revolutionary possibilities include a dimension of cognition that has not previously been machine-automated.
  4. Leaders in the field of this new development — specifically leaders, like lab administrators, corporate executives, and so on, as opposed to practitioners like engineers and scientists — recognize the sales potential of referring to this newly-“thinking” machine as “Artificial Intelligence”, often speculating about science-fictional levels of societal upheaval (specifically in a period of 5-20 years), now that the “hard problem” of machine cognition has been solved by N.
  5. Other technology leaders, in related fields, also recognize the sales potential and begin adopting elements of the novel mechanism to combine with their own areas of interest, also referring to their projects as “AI” in order to access the pool of cash that has become available to that label. In the course of doing so, they incorporate N in increasingly unreasonable ways.
  6. The scope of “AI” balloons to include pretty much all of computing technology. Some things that do not even include N start getting labeled this way.
  7. There’s a massive economic boom within the field of “AI”, where “the field of AI” means any software development that is plausibly adjacent to N in any pitch deck or grant proposal.
  8. Roughly 3 years pass, while those who control the flow of money gradually become skeptical of the overblown claims that recede into the indeterminate future, where N precipitates a robot apocalypse somewhere between 5 and 20 years away. Crucially, because of the aforementioned resource-intensiveness, the gold owners skepticism grows slowly over this period, because their own personal computers or the ones they have access to do not have the requisite resources to actually run the technology in question and it is challenging for them to observe its performance directly. Public critics begin to appear.
  9. Competent practitioners — not leaders — who have been successfully using N in research or industry quietly stop calling their tools “AI”, or at least stop emphasizing the “artificial intelligence” aspect of them, and start getting funding under other auspices. Whatever N does that isn’t “thinking” starts getting applied more seriously as its limitations are better understood. Users begin using more specific terms to describe the things they want, rather than calling everything “AI”.
  10. Thanks to the relentless march of Moore’s law, the specs of the average computer improve. The CPU, RAM, and disk resources required to actually run the software locally come down in price, and everyone upgrades to a new computer that can actually run the new stuff.
  11. The investors and grant funders update their personal computers, and they start personally running the software they’ve been investing in. Products with long development cycles are finally released to customers as well, but they are disappointing. The investors quietly get mad. They’re not going to publicly trash their own investments, but they stop loudly boosting them and they stop writing checks. They pivot to biotech for a while.
  12. The field of “AI” becomes increasingly desperate, as it becomes the label applied to uses of N which are not productive, since the productive uses are marketed under their application rather than their mechanism. Funders lose their patience, the polarity of the “AI” money magnet rapidly reverses. Here, the AI winter is finally upon us.
  13. The remaining AI researchers who still have funding via mechanisms less vulnerable to hype, who are genuinely thinking about automating aspects of cognition rather than simply N, quietly move on to the next impediment to a truly thinking machine, and in the course of doing so, they discover a new novel mechanism, M. Go to step 1, with M as the new N, and our current N as a thing that is now “not AI”, called by its own, more precise name.

The History

A non-exhaustive list of previous values of N have been:

  • Neural networks and symbolic reasoning in the 1950s.
  • Theorem provers in the 1960s.
  • Expert systems in the 1980s.
  • Fuzzy logic and hidden Markov models in the 1990s.
  • Deep learning in the 2010s.

Each of these cycles has been larger and lasted longer than the last, and I want to be clear: each cycle has produced genuinely useful technology. It’s just that each follows the progress of a sigmoid curve that everyone mistakes for an exponential one. There is an initial burst of rapid improvement, followed by gradual improvement, followed by a plateau. Initial promises imply or even state outright “if we pour more {compute, RAM, training data, money} into this, we’ll get improvements forever!” The reality is always that these strategies inevitably have a limit, usually one that does not take too long to find.

Where Are We Now?

So where are we in the current hype cycle?

Some Qualifications

History does not repeat itself, but it does rhyme. This hype cycle is unlike any that have come before in various ways. There is more money involved now. It’s much more commercial; I had to phrase things above in very general ways because many previous hype waves have been based on research funding, some really being exclusively a phenomenon at one department in DARPA, and not, like, the entire economy.

I cannot tell you when the current mania will end and this bubble will burst. If I could, you’d be reading this in my $100,000 per month subscribers-only trading strategy newsletter and not a public blog. What I can tell you is that computers cannot think, and that the problems of the current instantation of the nebulously defined field of “AI” will not all be solved within “5 to 20 years”.


Acknowledgments

Thank you to my patrons who are supporting my writing on this blog. Special thanks also to Ben Chatterton for a brief pre-publication review; any errors remain my own. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor! I am also available for consulting work if you think your organization could benefit from expertise on topics like “what are we doing that history will condemn us for”. Or, you know, Python programming.