←back to thread

749 points noddybear | 1 comments | | HN request time: 0s | source

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

Show context
WJW ◴[] No.43332084[source]
Very cool and also pretty expected results tbh. Some thoughts:

Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.

Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.

It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.

I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.

replies(9): >>43332165 #>>43332202 #>>43332340 #>>43332409 #>>43332816 #>>43333224 #>>43333259 #>>43333347 #>>43333353 #
eterm ◴[] No.43332409[source]
> Building a main bus vs spaghetti belts is one of the obvious examples here

I'm an anti-bus extremist. ( I've even considered regsitering BanTheBus.com and doing an over-the-top static anti-bus website. ), so take what I'm about to say with a pinch of salt. Also note that my post applies to non-space-age. Space-age changes the gameplay fundamentally, so this really only applies to factorio 1.1 or 2.0 with space-age disabled. Gleba in particular breaks the JIT model. ( Fuck Gleba. )

Busses are the opposite of good factorio factories. They undo a lot of the benefits of a healthy just-in-time (JIT) manufacturing, by encouraging massive amounts of buffer (belt-buffer).

They also encourage people to anti-learn fundamental principles. You often people do "starter-busses" with 4 lanes of iron plates, but only fed by one actual belt worth of smelting. Then people look for all kinds of "balancing" solutions to try to alchemize one belt into 4 belts.

They encourage massive amounts of over-spend on expensive splitters to keep "balancing" the bus to make it look neat, over actually just focusing on what needs to be built.

Spaghetti on the other hand is much better for actually getting to the end-goal. Start by placing what you want to build, then look for what it needs. Work out how to feed it by any means necessary. If you dont' have enough input, then build more of that input. Then repeat as necessary.

There's no such thing as too much input. With short belts (even direct insert where possible), buffers are kept to a minimum, and any "overprodction" is stopped at source, because assemblers don't produce if they have nowhere to output into.

The biggest classic beginner mistakes in factorio are:

- Sticking things in chests. Even worse, trying to "maintain production" by picking up those chest contents. ( This comes from an RTS mindset where "idle" workers are a big sin. )

- Trying to increase throughput by replacing yellow belt with red belt when their yellow belt wasn't saturated.

- Looking for guides and discovering "The Main Bus".

That last point is so common, and not only does it take away some of the creativity of the game, but busses are inherently a bad solution that makes all bases look the same, and produces a mediocre result.

Look at how speedrunners are able to complete the game on default settings in sub 2hr30. They're not producing oodles of red belts. They're not producing main busses. They're not even producing railways. They're hyper focused on what's actually needed, which is very little indeed.

replies(9): >>43332549 #>>43332651 #>>43332833 #>>43333131 #>>43333490 #>>43333504 #>>43333743 #>>43335367 #>>43377310 #
1. deterministic ◴[] No.43377310[source]
I have tried different approaches and ended up with a single small bus of raw materials (coal,cobber,iron,stone) with everything else hanging off it. It scales amazingly well and avoid spaghetti layouts.

Oil stuff is done separately and fed into the structure where needed.