←back to thread

749 points noddybear | 2 comments | | HN request time: 0.64s | source

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

Show context
kevmo314 ◴[] No.43332331[source]
Does it provide screenshots of the game state? I, too, would struggle to play the game pretty effectively if I could not visually see the game.
replies(2): >>43332496 #>>43332607 #
martbakler ◴[] No.43332607[source]
Just to jump in here as one of the authors

We designed the API to be as spatially descriptive as possible (include x-y coordinates and neighbors in game state descriptions) and the agents have tools to aid them in carrying out actions which would benefit from vision (i.e find buildable areas on the map with different sizes, placing entities next to other entities etc).

As Jack said, we completed most of the lab tasks manually ourselves and while it took us a lot longer compared to having vision, the tasks were still doable and the human performance is significantly higher than current agents. We are thinking of supporting vision for future evals but from a small number of tests we ran, current models got even more confused as the number of entities on the map grows quite quickly. This is likely due to VLMs being notoriously bad at visual reasoning on images with lots of detail and in a game where one misplaced entity in a large factory breaks everything, the errors start to compound

replies(1): >>43332849 #
Hammershaft ◴[] No.43332849[source]
Someone below mentioned the ASCII interface for dwarf fortress as being ideal for this, and I wonder if that kind of representation with a legend might produce spatially better results. The drawback I see is that elements can be layered on a tile in Factorio, or have properties that are not visually obvious in ASCII, so the llm would need to be able to introspect on the map.
replies(1): >>43332896 #
noddybear ◴[] No.43332896[source]
I think your intuition is correct about the amount of information that needs to be encoded into an ASCII char. You could potentially use unicode to pack more more into each char, e.g direction, type, status etc. Or make each representation available on-demand, i.e 'show me the direction of all inserters in a 10 tile radius'.
replies(1): >>43333241 #
1. vessenes ◴[] No.43333241[source]
Well we learned last month on HN that you can encode arbitrary data into Unicode; anecdotally, o3-mini-high at least could decode it if given instructions.

I wonder what a quick way to calculate how many Unicode characters you’d need is.. I guess every entity + four orientations. Underground belts and pipes seem tough. But I guess you could just add an encoding showing if the square has an underground pipe or encoding.

I propose this would work. I think I’ll give it a try today.. I’d love dwarf fortress factorio. That said, the encode/decode phase seems like a lot of tokens for a model that’s not trained to understand the Unicode ‘map’. Seems like you’d want to fine tune something at least. Maybe a layout model.

replies(1): >>43333393 #
2. noddybear ◴[] No.43333393[source]
Checkout the 'ObserveAll' tool in the repo - its deprecated now, but it pipes all the raw entities on the map back to the agent. You could procedurally convert it to unicode format given a pre-defined codebook (which you give to the agent) before letting the agent observe and reason over it.