←back to thread

749 points noddybear | 1 comments | | HN request time: 0s | source

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

Show context
gglon ◴[] No.43335216[source]
I was thinking, to build a large, efficient factory autonomously, one could use LLM as a high level agent that is using specialized tools. The overall strategy would perhaps look like following:

1. create a (intermittent) goal for a resource production

2. create a factory graph with calculated number of machines and number of resources required to transport between them. This would be done by using linear programming (factorio calculator)

3. somehow map the resulting graph to a hardware description language. Such that each entity would be mapped to unique logic component. And each transport lane would be mapped to a unique wire (most difficult)

4. compile to 2d FPGA layout using all the VLSI algos like partitioning, routing (hdl compiler)

5. map the resulting plan back to a concrete factorio design

replies(1): >>43339247 #
1. jkhdigital ◴[] No.43339247[source]
This is exactly what I’ve been thinking as I see LLMs being applied to all these complex problem domains. Humans did not conquer the world because our intelligence can solve every problem, we did it by using our intelligence to (1) break down complex problems into small, manageable pieces and (2) designing tools and machines that were exceptionally good at efficiently solving those subproblems.

The other recent example that comes to mind is the paper that explored the reasoning process used by LLMs to answer trivia questions like “Name a national capital whose letters can be rearranged to spell a common greeting in the language of a neighboring country.” (answer is Hanoi by the way)

The LLM responses show that they intuitively grasp the algorithm for answering such a question, but then they basically run the algorithm in their own thoughts (self-talk) which is horrendously inefficient.

Put differently, natural language reasoning is brilliant at turning the messiness of the real world into well-defined abstractions, but as soon as that is done it needs to hand off the task to a machine. For “solved” problems this might be a formally specified machine, but it could also be another class of model such as AlphaZero (along with a proper specification of the problem the “subcontractor” is to handle).