←back to thread

749 points noddybear | 2 comments | | HN request time: 0.418s | source

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

1. myrmidon ◴[] No.43332121[source]
Fascinating. Would have loved to see more pictures of the bigger factories-- or is the zig-zag belt into plastic production currently the best result?

I think this very clearly illustrates a big weakness of current LLMs-- humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't-- yet.

I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.

Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?

replies(1): >>43332317 #
2. noddybear ◴[] No.43332317[source]
I have some pictures of bigger factories - but they tend to be filled with artefacts and general nonsense. I'll dig them out and add them to the appendix. The zig-zag into plastic production was the best 'lab' result, as its pretty clear what the agent is doing.

Yes, the agents can consistently produce economic growth in game - but we don't really see a take off, where the growth keeps compounding over time. This is certainly _possible_ in FLE, as agents could write their own Python utility functions etc to construct and manage large factories (imagine imperative Factorio blueprints), but we haven't seen that yet.

Designing the API to not get in the way was the biggest challenge. It was imperative to avoid modal collapse - where the factory could not be sufficiently well expressed in the outputs of a program. While we think that we have generally 'solved' this, there are occasionally examples where the agent acts based on its previous output, but fails because there is something blocking it that it cannot easily see. One example would be the edge of water getting in the way of an entity placement.

All of the lab tasks were completed by a human using only the API, and we have lots of tests (inductively) demonstrating that it is possible to get to a rocket launch using the API alone.