←back to thread

749 points noddybear | 8 comments | | HN request time: 0.648s | source | bottom

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

1. scottmsul ◴[] No.43334576[source]
There was a HN post here not too long ago about a team that used reinforcement learning to train an agent to beat pokemon red. They mentioned how they had to tweak the cost function to give small rewards for exploring and big rewards for completing "essential tasks" like beating gyms.

I wonder if this same approach could be used here in factorio? Using the pokemon red analogy the main "essential tasks" in Factorio are setting up automation for new items and new science packs. I think a good reward function could involve small rewards functions for production rates of each item/sec, medium rewards for setting up automation for new items, and big rewards for automating each new science pack.

Telling a Factorio agent to just "make a big factory" is like telling a pokemon red agent to just "beat the game", it has to be broken down into smaller steps with a very carefully tuned reward function.

Thinking about this is really making me want to jump into this project!

replies(5): >>43334684 #>>43334703 #>>43336513 #>>43337120 #>>43343134 #
2. scottmsul ◴[] No.43334684[source]
Also I should add, being a Factorio veteran with 2-3k hours in this game, I think the goal of making the "largest possible factory" is too vague and not the right metric. When Factorio players make large megabases, they don't go for "size" per se, but rather science research per minute. The metric you should be telling the agents is SPM, not "largest" base!
replies(2): >>43337182 #>>43337228 #
3. noddybear ◴[] No.43334703[source]
In FLE, you have access to milestones representing the first time a new entity was created, but coming up with a stratification of rewards for different degrees of automation would be really interesting. Join us!
4. mclau156 ◴[] No.43336513[source]
The same approach could be used in life
5. martbakler ◴[] No.43337120[source]
This is interesting, one of our findings was that the Claude was capable of essential tasks & simple automation (i.e iron gear wheel factory in lab-play) but didn't even try to do it during the "build the biggest factory" game episodes. So the models can do these essential tasks but when given a general goal, i.e "complete the game", they don't have a good level of long-term planning to even try to attempt them. Often they just did un-coordinated small-scale constructs without attempting to scale up existing factories

That was also one of our goals, to find out how do the models act when given a very vague and general objective

6. soulbadguy ◴[] No.43337182[source]
ahhh another factorio addict :) Curious, how long was your first play through (assuming in v1.x lanching the first rocket)
7. csense ◴[] No.43337228[source]
Agree, "largest" base has some pathologies.

Put machine #1 at the starting location, run in one direction, and put machine #2 just before time runs out.

This is going to be a huge factory (as measured by its bounding box) but it's not super interesting.

8. Gasp0de ◴[] No.43343134[source]
Did you read the page? Because they did give rewards per item produced, and more complex items gave higher rewards.