←back to thread

749 points noddybear | 3 comments | | HN request time: 3.845s | source

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

Show context
devit ◴[] No.43334839[source]
Seems like it might be more effective to use the LLMs to write a program that plays Factorio rather than having them pick the next action given a game state.

Also in general I think the issue with Factorio is that you can just find an "optimal" factory design and build order and just follow it every time; perhaps starting with a suboptimal building layout already present and restrictions like being unable to change them or build others of the same type could help.

replies(1): >>43334956 #
1. noddybear ◴[] No.43334956[source]
This is exactly how FLE works, the agent writes a program that executes its policy.

I think you bring up a good point, we could create tasks where the goal is to optimise a static factory, starting from a kernel of functionality like 'steam engine power supply' etc.

replies(1): >>43337137 #
2. devit ◴[] No.43337137[source]
But it seems like it's being used to generate short snippets that in the examples seem to be equivalent to command lists as opposed to generating a full program that actually plays the whole game by itself.

The model could also then be fed back the results of running the program and iteratively change it as needed.

I.e. prompt first with "Write a program that can play Factorio automatically given an interface <INTERFACE SPECIFICATION> and a set of goals in <GOAL FORMAT>, and produces text output that can help determine whether the program is working correctly and whether tasks are performed efficiently and goals are reached as fast as possible"

And then with "the program was run and produced this text output: <TEXT OUTPUT> Determine any possible bugs, avenues of improvements or missing output information and modify the program accordingly, printing the new version".

And iterate until there doesn't seem to be an improvement anymore.

replies(1): >>43343146 #
3. noddybear ◴[] No.43343146[source]
If I understand you correctly, this approach is sort of supported in FLE - the agents can create functions that encapsulate more complex logic. However, interaction is still synchronous/turn-based. I think to do what you propose, you will need to create event listeners that can trigger the agents program whenever appropriate.