←back to thread

749 points noddybear | 6 comments | | HN request time: 0.381s | source | bottom

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

1. Starlord2048 ◴[] No.43334138[source]
[flagged]
replies(3): >>43334429 #>>43334450 #>>43335187 #
2. andai ◴[] No.43334429[source]
Fascinating. I was thinking how the factory should be communicated to the model, and represented "internally". Images aren't the right solution (very high bandwidth for no real benefit). An ASCII grid of the game's tiles (more likely, a small chunk of it) is orders of magnitude better, but you still don't need to simulate every tile in a conveyor. It's just a line, right? So the whole thing is actually a graph!

That compresses nicely into text, I imagine.

I'd like to hear more details about your symbolic approach!

replies(2): >>43334643 #>>43335562 #
3. noddybear ◴[] No.43334450[source]
This is really interesting, do you have a repo or anything describing the approach? I would be particularly interested in trying your approach in FLE to see how it affects layout design. How are you performing the spatial reasoning?
4. HideousKojima ◴[] No.43334643[source]
>An ASCII grid of the game's tiles (more likely, a small chunk of it) is orders of magnitude better, but you still don't need to simulate every tile in a conveyor. It's just a line, right? So the whole thing is actually a graph!

Until you accidentally feed a different material into your belt and need to clean it up

5. mlsu ◴[] No.43335187[source]
Yes!

The way I think of it is this. Yes, the LLM is a "general reasoner." However, it's locked in a box, where the only way in and out is through the tokenizer.

So there's this huge breadth of concepts and meanings that cannot be fully described by words (things like, spatial reasoning, smells, visual relationships, cause/effect physical relationships etc). The list of things that can't be described by words is long. The model would be capable of generalizing on those, it would optimize to capture those. But it can't, because the only thing that can fit through the front door is tokens.

It's a huge and fundamental limitation. I think Yann Lecunn has been talking about this for years now and I'm inclined to agree with him. This limitation is somewhat obscured by the fact that we humans can relate to all of these untokenizable things -- using tokens! So I can describe what the smell of coffee is in words and you can immediately reconstruct that based on my description, even though the actual smell of coffee is not encoded in the tokens of what I'm saying at all.

6. nostrademons ◴[] No.43335562[source]
Probably the memory model of the game itself is the best representation. The devs have already spent a significant amount of development cycles optimizing this down to a minimal compressed form - belt runs, for example, are one entity regardless of how long they are. The LLM is then effectively modeling the degrees of freedom of the game simulation and picking code paths within them.