Why Are Large Language Models So Terrible at Video Games?


I tried to beat a video game with a large language model once. Not metaphorically. Not in some abstract “AI plays chess” sense. I mean I sat there, controller in hand, screen glowing, and fed instructions into a system that supposedly understands language, logic, strategy, and—depending on who you ask—the trajectory of civilization itself.

It could explain the entire plot of the game in flawless prose. It could outline optimal strategies like a smug prima donna of Wikipedia entries. It could even tell me which boss I’d struggle with and why.

And then, when it came time to actually play?

It moved like a drunk ghost trapped in a Roomba.

That was the moment it hit me: large language models—these towering monuments of modern computation—are spectacularly bad at video games. Not just a little clumsy. Not “learning curve” bad. I’m talking walk-into-a-wall-for-thirty-seconds-while-explaining-the-wall’s-historical-significance bad.

So naturally, I had to ask the question: why?


The Illusion of Intelligence (aka: The TED Talk vs. The Controller)

Here’s the first thing I had to come to terms with: large language models don’t actually “think.” They perform something much weirder—and far less useful for gaming.

They predict the next word.

That’s it.

Strip away the hype, the venture capital, the keynote speeches, and the existential dread, and what you have is a system that is extraordinarily good at saying, “Given everything that’s been said so far, what comes next?”

Which is incredible… until you hand it a controller.

Because video games are not about predicting the next word. They’re about predicting the next moment. And those are wildly different beasts.

When I’m playing a game, I’m not thinking:

“Ah yes, statistically speaking, the next most probable action is to press jump.”

I’m thinking:

“OH GOD THE FLOOR IS LAVA WHY IS THERE A DRAGON—”

And somehow, that chaotic, reactive, deeply human panic translates into better gameplay than a model trained on half the internet.


Time Exists (And LLMs Are Not Invited)

Video games are brutal in one specific way: they happen in real time.

You cannot pause a boss mid-swing and say:

“Hold on, I need to generate a response based on my training data.”

The boss does not care about your token limits. The boss does not respect your latency.

The boss will hit you.

Large language models, on the other hand, operate in a very polite universe where time is… optional. They wait for input. They process. They respond. Everything is neat, sequential, and suspiciously civilized.

Gaming is not civilized.

Gaming is:

  • Frame-perfect timing
  • Split-second reactions
  • Constant sensory overload
  • The emotional stability of a caffeinated squirrel

So when you plug a system built for calm, turn-based text prediction into a medium that demands twitch reflexes, you get something that feels like watching a philosopher try to swat a fly with a hardcover copy of Kant.

Technically impressive.

Practically useless.


The Body Problem (Spoiler: There Is No Body)

Here’s something I didn’t appreciate until I watched an LLM try to “play” a game: it has no body.

And that matters more than you think.

When I play a game, I feel it:

  • The weight of a jump
  • The rhythm of movement
  • The subtle delay between input and response
  • The way a character accelerates or drifts

These are not things I consciously calculate. They’re things I internalize. My brain builds a model of the world through repetition, failure, and the occasional rage quit.

An LLM?

It has none of that.

No muscle memory.
No spatial intuition.
No sense of momentum.

You can tell it:

“Press jump when the platform reaches its peak.”

And it will nod, metaphorically speaking, and say:

“Yes, that is correct.”

And then it will jump three seconds late, directly into oblivion, while explaining why jumping is generally advantageous in platforming scenarios.

It’s like trying to teach someone to ride a bike using only a textbook and a PowerPoint presentation. At some point, gravity has to get involved.


The Map Is Not the Territory (And the LLM Lives in the Map)

Large language models are masters of description.

They can describe a game world beautifully:

  • The layout of a dungeon
  • The mechanics of a puzzle
  • The behavior of enemies

But here’s the problem: they live in the description of the world, not the world itself.

A human player navigates the territory. We see, react, adapt. We don’t need a paragraph explaining that a spike trap is dangerous—we just learned that the hard way five seconds ago.

The LLM, meanwhile, is stuck narrating:

“The player should avoid the spikes, as they cause damage.”

Yes. Thank you. I discovered that when my character turned into a kebab.

This disconnect becomes painfully obvious in dynamic situations. Games are not static. They change. Enemies move. Physics does weird things. The unexpected is the default setting.

And the LLM?

It’s trying to apply a static understanding to a dynamic system.

Which is like trying to win a boxing match by reciting the rules of boxing.


The Confidence Problem (It’s Wrong, But It’s So Sure About It)

One of my favorite (and by favorite, I mean deeply frustrating) aspects of large language models is their confidence.

They will tell you, with absolute certainty, how to beat a level.

They will outline strategies, highlight weaknesses, and deliver it all with the tone of someone who has already completed the game, unlocked the secret ending, and written a memoir about it.

And then you follow the advice.

And it fails.

Spectacularly.

Because the model isn’t actually testing its ideas. It’s generating them based on patterns in data. It doesn’t know if the strategy works—it knows that the strategy sounds like something that would work.

In gaming, this is a death sentence.

Because games are unforgiving. They don’t care how convincing your explanation is. They care if you pressed the right button at the right time.

And when the LLM confidently suggests a move that leads directly to your demise, you start to realize that confidence and competence are not the same thing.

Which, to be fair, explains a lot about the world beyond gaming as well.


The Feedback Loop (Or Lack Thereof)

When I play a game, I’m in a constant loop:

  1. Try something
  2. Fail
  3. Adjust
  4. Try again

This loop happens fast. Sometimes in seconds. Sometimes in milliseconds.

It’s messy. It’s inefficient. It’s deeply human.

Large language models don’t live in that loop.

They don’t experience failure. They don’t feel the sting of losing progress. They don’t develop that quiet, simmering determination that comes from dying to the same boss seventeen times in a row.

They generate.

That’s it.

There’s no intrinsic drive to improve. No internal “I messed that up, let me try differently.” Any improvement has to be engineered externally—through training updates, reinforcement systems, or elaborate scaffolding.

Meanwhile, I’ve already learned the boss’s attack pattern out of pure spite.


The Interface Problem (Translation: Good Luck With That)

Even if you somehow solved all the above problems—timing, embodiment, learning—you still have one massive hurdle:

How do you even connect a language model to a game?

Games don’t speak English.

They speak:

  • Controller inputs
  • Frame data
  • Physics engines
  • Pixel states

So now you need a translator:

  • Something that converts game state into text
  • Something that converts text back into actions

And every layer of translation introduces delay, ambiguity, and error.

It’s like playing a game of telephone where the final instruction is:

“Jump left to avoid the enemy,”

and what actually happens is:

“Walk forward into certain doom.”

At some point, you realize you’ve built an incredibly sophisticated system… whose primary achievement is making simple actions unnecessarily complicated.


The Overthinking Paradox

Here’s the part that really gets me.

Humans are not perfect gamers. We make mistakes constantly. We panic. We misjudge. We press the wrong button at the worst possible time.

And yet, we are still better at most games than large language models.

Why?

Because we don’t overthink in the same way.

We act.

We rely on instinct, pattern recognition, and a kind of embodied intuition that doesn’t require us to articulate every step.

The LLM, on the other hand, is trapped in a world where everything must be processed, framed, and generated.

It’s like the difference between:

  • Dancing to music
  • Writing a dissertation about dancing while the music plays

One of these will keep you alive in a rhythm-based game.

The other will get you eliminated before the intro finishes.


But Wait—They’re Supposed to Be Smart

This is where the hype machine starts to creak.

We’ve been told—repeatedly—that these models are intelligent. That they understand. That they can reason, plan, and solve complex problems.

And in many domains, that feels true.

Ask them to explain quantum physics? Impressive.
Ask them to summarize a legal document? Useful.
Ask them to write a 3000-word blog post with a mildly unhinged tone? Apparently, also very doable.

But ask them to play a video game?

Suddenly, the illusion cracks.

Because games expose something fundamental: intelligence is not just about language. It’s about interaction. Adaptation. Embodiment. Timing.

It’s about being in the world, not just describing it.

And right now, large language models are exceptional narrators of reality—but terrible participants.


The Inevitable Future (Because Of Course This Won’t Last)

Now, before we all get too comfortable laughing at our digital philosopher tripping over a Goomba, let’s be honest:

This won’t stay true forever.

People are already working on:

  • Reinforcement learning systems that learn through gameplay
  • Multimodal models that process visuals, actions, and feedback
  • Agents that operate continuously in real-time environments

Eventually, we’ll get systems that:

  • See the game
  • Understand the game
  • Act in the game

And when that happens?

They’ll probably be terrifyingly good.

Not just “beats you at chess” good.
More like “optimizes the game into something unrecognizable” good.

The same system that currently walks into walls will one day:

  • Discover strategies no human considered
  • Execute them flawlessly
  • And then explain, in perfect English, why you never stood a chance

Which is both exciting and mildly unsettling.


Final Thoughts (From Someone Who Has Watched an AI Miss a Jump 47 Times)

So why are large language models so terrible at video games?

Because they are:

  • Disembodied
  • Non-reactive
  • Overly reliant on description
  • Detached from real-time feedback
  • And fundamentally built for a different kind of problem

They are brilliant in the abstract and awkward in the immediate.

They can tell you how to win, but they can’t quite do the winning.

At least, not yet.

And honestly?

There’s something comforting about that.

Because in a world where machines are increasingly capable, it’s nice to know there’s still a domain where human intuition, reflexes, and sheer stubborn persistence reign supreme.

Where success isn’t about generating the most plausible answer—but about surviving the next ten seconds.

Where the difference between victory and failure is not a well-constructed sentence, but a perfectly timed jump.

And if you’ll excuse me, I have a boss fight to retry.

For the eighteenth time.

Without a TED Talk.

Just vibes.

Comments

Popular posts from this blog

Skip to Content, Skip to Site Index, But Don’t Skip These Weirdly Wonderful Films of 2025

Factory-Made Skyscrapers and Lego Apartments: When Manufacturing and Construction Hook Up

Pickle-Fried Oreos and Cotton Candy Ale: Indiana State Fair’s Annual Culinary Cry for Help