←back to thread

749 points noddybear | 1 comments | | HN request time: 0.216s | source

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

Show context
infogulch ◴[] No.43334233[source]
Interesting to see only a handful of complex scenarios. I've always suspected ML game agents need hundreds of tiny puzzles with hundreds of variations each to learn game mechanics properly. Like:

    The factory is not powered, place the missing power pole(s)
    The factory is missing items, place the missing belt(s)
    Craft and place these 200 assembly machines
    The assembly machine is not running for some reason, fix it
    The factory production is too low, double it
    Get to this other point in the factory as fast as possible
    Fix the brownout
    All of the above with and without bots
Programmatically generating a few thousand example scenarios like these should be relatively easy. Then use it like an IQ test question bank: draw a dozen scenarios from the bank and evaluate performance on each based on time & materials used.

I hypothesize that ML agents learn faster when evaluated on a sample from a large bank of scenarios of smoothly increasing complexity where more complex scenarios are presented after it scores sufficiently high on lower complexity scenarios.

replies(2): >>43334396 #>>43334754 #
noddybear ◴[] No.43334396[source]
I think generating the scenarios as you suggest (in text) is easy, but creating correct factory game states to start from is a lot harder. AFAIK it reduces into the same manual task of designing an init state and a task to complete.
replies(1): >>43334572 #
infogulch ◴[] No.43334572[source]
Yes each scenario will need someone to design it, but you can get a lot of mileage out of each. E.g. consider the "place the missing power pole" scenario: manually build a factory with a few dozen machines connected to a couple steam engines with 20 power poles, then you can generate 400 playable puzzles/scenarios by deleting 1-2 power poles from the working starting point. Humans would find all of these to be equivalent, but I think agents need the explicit variation to learn the lesson properly.
replies(1): >>43343072 #
noddybear ◴[] No.43343072[source]
Oh super interesting! Create 10 scenarios containing working factories, and ‘drop out’ entities to break the factory in different ways. great idea.
replies(1): >>43346582 #
1. infogulch ◴[] No.43346582[source]
Yes exactly! This approach can generate hundreds of "fix the problem"-type tests very easily. With some creative thinking I suspect you can use variations to stack multipliers on other types of tests as well.