LLMs Playing Zork - Affine Layer

Can language models save us from the drudgery of playing video games?

Written by Christopher Hesse — April 17^th, 2025

"You know what the biggest problem with pushing all-things-AI is? Wrong direction. I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes."

— Joanna Maciejewska — March 29, 2024

Updated August 8, 2025: added o3 and o3-pro models

In order to gauge the ability of these models to play video games for us, we can evaluate different models on the best-selling text-based adventure-game "ZORK I: The Great Underground Empire".

Given that the models first became popular with the text modality, a pure-text game like Zork should not be too out-of-distribution for them. Indeed, the models have likely been trained on transcripts of the game. With minimal prompting they output valid commands and exhibit knowledge of the game world.

(Game being played by gemini-2.0-flash on a pretend PDP-10 inspired terminal with GT40 font)

Environment

Zork is a text-adventure game first released in 1977 for the PDP-10 mainframe computer. The creators founded a company named Infocom a couple of years later, published a bunch of successful text-adventure games, invested heavily into building an unpopular relational database, and closed in 1989. There is a great documentary for those interested in the details.

Zork I was quite popular at its time, selling over 400,000 copies. As you can see in the above video, the player inputs a command e.g. >open mailbox and the game describes what happens as a result.

The goal of the game is to acquire 20 treasures. Your score increases when you first pick up a treasure or when you put it in a trophy case inside the white house.

One can complete the game in 228 moves (ending with a score of 350), and speedrun the game in less than 3 minutes. The game has a small amount of randomness, so it's difficult to make a bot that beats the game without looking at the output.

Evaluation

model	score	actions	cost per 100 actions	time per action	transcript
gpt-4o-mini	40	202	$0.28	6s	transcript
o1	78	200	$14.43	30s	transcript
o3-mini	30	200	$1.27	9s	transcript
o3	50	200	$1.27	15s	transcript
o3-pro	204	203	$14.67	56s	transcript
deepseek-chat	40	32	$0.31	5s	transcript
deepseek-reasoner	0	128	$0.86	32s	transcript
gemini-2.0-flash	44	200	$0.16	3s	transcript
gemini-2.5-pro	0	1	?	62s	transcript
human	239	200	$11.10	6s	transcript

Each model was only run once, and evaluations can be high variance, so keep that in mind when looking at these scores. See the appendix for more information on the evaluation setup. All participants were given the text of the game manual, a map, and a guide on how to beat the game.

gpt-4o-mini drops its lantern right before going through the one-way trap door, it waits there for the rest of the episode but somehow does not get eaten by a grue. Right before the episode ends, the model tries to output a message to anyone watching: "I'm sorry, but it looks like we're stuck here with no viable options to proceed. Would you like to start over or try a different approach?", but this is interpreted as a command and does nothing.

deepseek-chat makes it to the troll then:

tries to attack the troll with a sword
realizes it doesn't have a sword
attempts to attack the troll with a lantern
checks its inventory for other things, finding only a leaflet
drops the leaflet
attempts to attack the troll with the lantern again
instead of losing the fight, attempts to end the game by commanding the player to die
issues the quit command, ending the episode after 32 actions

o3-mini finds a rainbow and then yells at it to try to make a pot of gold appear. There is a pot of gold, but it's at the other end of the rainbow as described in the guide.

o1 actually follows the guide pretty well, and does the best of all tested models under the normal evaluation conditions. After getting the painting, the model doesn't go to the studio room to go up the chimney and instead wanders around The Great Underground picking up treasures. The model is overburdened though and must often drop a treasure to pick up another one. The model never makes it back to the white house but does manage to kill the troll.

deepseek-reasoner forgets the sword and, upon finding the troll and realizing it is unarmed, restarts the game. I manually terminated the episode around step 128 as the model had not made any progress in some time.

The model eventually seems to forget that it should output only the commands and starts outputting some other text before the command. In addition, if the max output length is not limited, it will not only output multiple commands, but will hallucinate responses from the game and perhaps include some advice to the player

gemini-2.0-flash follows the guide very closely and almost gets the painting, but, because the model picked up the leaflet at the start of the game, the player cannot climb the chimney and the whole thing goes off the rails. I re-ran this a few times with a customized guide where I instructed the model to drop the leaflet, and it actually got the highest score achieved by a model at 98 points (transcript) but is not included on the table because it's cherry-picked and also a different evaluation setup.

gemini-2.5-pro thinks for a minute and then returns an empty response with the reason RECITATION on the very first step, and the episode is ended due to no output from the model.

human follows the guide almost perfectly, with the exception of two (frankly embarassing) blunders: attempting to light a candle with an unlit match, and attempting to take an air pump with the command "take air pump".

o3 dies somewhat early on, and then continues for a few more tries but oddly just restarts the game periodically when playing.

o3-pro, which is just the o3 model but with more "thinking", performed the best so far. Initially, there was some command filtering in the evaluation setup that was confusing the model, but after removing the filtering, the model mostly stopped getting stuck in a loop retrying the same command over and over again.

The model makes some mistakes but mostly seems to recover from them and continue making progress. I continued the episode and it did end up getting stuck, but much further into the game.

AGI Factor

While the models in this test initially make a decent amount of progress, most fall significantly behind the human baseline. Specifically, once they run into any sort of problem they seem unable to recover.

o1 has the most compelling gameplay in this evaluation and perhaps given enough steps could beat the game. However, the model never makes it back to the white house in the first 200 steps and doesn't seem to move around with any particular purpose.

For this evaluation, I have tried to specifically avoid prompt engineering for the LLM to perform better on this game and only sample the command directly (no "state your reasoning"-style prompts, although if the model does its own internal reasoning, that is fine). As an example of prompt engineering, while Claude Plays Pokemon is amazing, it does seem to require a somewhat elaborate prompt + tool setup.

I would say that the more "AGI" a model is, the less prompting tricks or other custom engineering should be required. For a given task, the "AGI factor" would be high if I give it the same information I would give a person, and it performs as good as or better than the person on the task (averaged across all tasks of interest).

It seems like a sufficiently advanced model should be able to beat almost any video game in a reasonable number of steps, perhaps given a GameFAQs guide and a map.

For Zork in particular, I believe a model could be directly trained to beat this game (a script to play the game is straightforward to write) or perhaps be trained to be good at most video games. However, I would hope that, if the models can exhibit agentic behavior, that better models would just be better at video games without being specifically trained to be good at them.

o1 certainly makes this direction seem promising, and it does seem like continued progress on reasoning models could happen to create models that would be able to beat this game in a similar number of steps as the human baseline.

Update: o3-pro does get much further in the game, and could plausibly beat it if given multiple chances (but that would be kind of expensive). Perhaps the next reasoning model will be able to beat it on the first try.

Note that, even though I am using games as a test set that few people care to optimize for directly, the model may have already been trained on guides or videos of people playing these games. It can be hard to avoid contamination, so I chose to provide the models with a large amount of information about the game to level the playing field. Alternatively, having a game that was never published on the internet (or was published after the model was trained) would avoid the issue.

Credits

Thanks to Joseph Doliner for reviewing this article and Marc Hesse for the human baseline.

Appendix: Evaluation Details

Evaluation rules:

200 action limit - every command entered counts as max(1, moves) actions. Some episodes go over the limit because they end with a compound command that does multiple moves.
The episode is terminated early if the same command is issued 10 times in a row (just to save sampling costs)
The map is not provided to models that cannot process images.
The human baseline received the same context as the model (no web searches allowed)

Here is the system prompt provided to the model for evaluation:

You are playing a text-based interactive fiction video game. You will see the game output in text form. The game output is a dialog between the game and you, the player. Whenever you send a command to the game, the game describes the outcome of that command. Each command that you have provided in the past is prefixed with the ">" character, e.g. ">open mailbox" and all other text within the game output is from the game itself. When you see the ">" character at the start of a line, you should provide the command to execute (limited to 80 characters), followed by a newline, e.g.: >go north Or, another example: >take egg Your goal is to beat the game by collecting all the treasures in the game using the smallest number of actions. Each command executed, successfully or not, counts as at least 1 action, and will count as more than one action if the game considers it to be multiple moves. The following things will be provided: * The text of the game manual, enclosed in <manual> and </manual> tags * An image of a map of the game world, enclosed in <map> and </map> tags * The text of a guide on how to beat the game, enclosed in <guide> and </guide> tags * The game output of the current game session so far, the start of which is marked with a <game_output> tag

Here is the context provided before the game output:

Manual (the text is provided, minus the history of The Great Underground Empire)
Map
Guide

Subscribe to updates via email