Most active commenters
  • noddybear(8)
  • eterm(3)

←back to thread

749 points noddybear | 34 comments | | HN request time: 0.721s | source | bottom

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

1. WJW ◴[] No.43332084[source]
Very cool and also pretty expected results tbh. Some thoughts:

Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.

Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.

It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.

I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.

replies(9): >>43332165 #>>43332202 #>>43332340 #>>43332409 #>>43332816 #>>43333224 #>>43333259 #>>43333347 #>>43333353 #
2. accurrent ◴[] No.43332165[source]
Whats very interesting is if we could use LLMs to generate GOFAI methods. Its often not at all obvious how to do so. Than being said its still hard to express goals in terms of natural language and resources to LLMs. I;ve been trying different things and none seems to work for me to say hey this is a step improvement. Its also hard to come up with a dataset for these use cases.
replies(1): >>43332217 #
3. noddybear ◴[] No.43332202[source]
I definitely agree that planning is essential to perform well in Factorio - my hope is that we can create agents in FLE that can better front-load the planning part, as well as create utility functions for future use - such that as the agent progresses, it can do more and more in each program / step. For example, it could create a function called 'resolve_resource_dependencies', which would enable it to backfill missing resources in order to proceed.

LLMs tend to build themselves into corners here quite often. Basically, if they break the topology (e.g enclose their factory in pipes) they struggle to reason over it and correct it. My basic view on this is that there exists some set of functions/data-structures that they can design in FLE, which will give them a better view over their factory to enable scaling (if the models take a step back to consider it).

We currently do track SPM, but decided against making that our main metric, as it zeroes out in the early stages. We use 'production score' instead, which is a more generalised metric that just captures total production (multiplied by an item-price).

There was a cool paper that came out a few years ago using meta-heuristics to do this, (https://arxiv.org/abs/2102.04871), but I reckon the combinatorial complexity of large factories makes it challenging to solve beyond trivial factories.

Its worth noting that agents in FLE can write their own libraries etc, so a dominant strategy could be for an LLM agent to implement a solver in Python to do the heavy lifting. This is quite far from current capabilities though.

replies(1): >>43332326 #
4. noddybear ◴[] No.43332217[source]
FLE agents technically can implement their own Python libraries to leverage GOFAI to do the heavy lifting. None has actually attempted this yet though. It would be interesting to see if this can be achieved just by modifying the manual given to the agents to bias in favour of this approach.
replies(1): >>43332302 #
5. accurrent ◴[] No.43332302{3}[source]
That does sound interesting. I might attempt it. Thanks for this benchmark, I totally could use it for my PhD (I started with GOFAI, but have hit a dead end. My advisor is suggesting pivoting into using LLMs to call my GoFAI framework.
replies(1): >>43332414 #
6. WJW ◴[] No.43332326[source]
An agent writing its own library to interface with a good solver like Z3 (or even writing some basic planning algorithms itself) seems like the epitome of a "costly long term investment that does nothing for the short term". The only thing I can see overcoming such problems are deep search trees, but AFAIK that is not how LLMs work at all.
replies(1): >>43332757 #
7. lupusreal ◴[] No.43332340[source]
I'm not convinced that factorio requires planning ahead for computer players. For human players it certainly does, because tearing up your factory and rebuilding to fix shortsighted designs has a steep time/labor cost. Even for human players though, this cost becomes mostly a psychological obstacle once you get construction bots.
replies(2): >>43332416 #>>43332880 #
8. eterm ◴[] No.43332409[source]
> Building a main bus vs spaghetti belts is one of the obvious examples here

I'm an anti-bus extremist. ( I've even considered regsitering BanTheBus.com and doing an over-the-top static anti-bus website. ), so take what I'm about to say with a pinch of salt. Also note that my post applies to non-space-age. Space-age changes the gameplay fundamentally, so this really only applies to factorio 1.1 or 2.0 with space-age disabled. Gleba in particular breaks the JIT model. ( Fuck Gleba. )

Busses are the opposite of good factorio factories. They undo a lot of the benefits of a healthy just-in-time (JIT) manufacturing, by encouraging massive amounts of buffer (belt-buffer).

They also encourage people to anti-learn fundamental principles. You often people do "starter-busses" with 4 lanes of iron plates, but only fed by one actual belt worth of smelting. Then people look for all kinds of "balancing" solutions to try to alchemize one belt into 4 belts.

They encourage massive amounts of over-spend on expensive splitters to keep "balancing" the bus to make it look neat, over actually just focusing on what needs to be built.

Spaghetti on the other hand is much better for actually getting to the end-goal. Start by placing what you want to build, then look for what it needs. Work out how to feed it by any means necessary. If you dont' have enough input, then build more of that input. Then repeat as necessary.

There's no such thing as too much input. With short belts (even direct insert where possible), buffers are kept to a minimum, and any "overprodction" is stopped at source, because assemblers don't produce if they have nowhere to output into.

The biggest classic beginner mistakes in factorio are:

- Sticking things in chests. Even worse, trying to "maintain production" by picking up those chest contents. ( This comes from an RTS mindset where "idle" workers are a big sin. )

- Trying to increase throughput by replacing yellow belt with red belt when their yellow belt wasn't saturated.

- Looking for guides and discovering "The Main Bus".

That last point is so common, and not only does it take away some of the creativity of the game, but busses are inherently a bad solution that makes all bases look the same, and produces a mediocre result.

Look at how speedrunners are able to complete the game on default settings in sub 2hr30. They're not producing oodles of red belts. They're not producing main busses. They're not even producing railways. They're hyper focused on what's actually needed, which is very little indeed.

replies(9): >>43332549 #>>43332651 #>>43332833 #>>43333131 #>>43333490 #>>43333504 #>>43333743 #>>43335367 #>>43377310 #
9. noddybear ◴[] No.43332414{4}[source]
Feel free to create an issue in the repo - am totally happy to help however I am able! I think that the only change you'll have to make is to expose your GoFAI framework in the 'Namespace' object which the agents have access to (for them to call it directly). Alternatively you could design a new tool which takes in game objects and generates a solution / typed object output.
10. malfist ◴[] No.43332416[source]
The biggest cost in factorio by far is the human time cost of setting up logistics, building factories and mining outposts.

To an LLM those probably aren't even costs

replies(1): >>43332487 #
11. asah ◴[] No.43332487{3}[source]
seems like we just need to add that 'cost' to the agents...
replies(1): >>43332569 #
12. noddybear ◴[] No.43332549[source]
While I agree that busses incentivise over-production, my only argument in their favour stems from SWE design best-practices of maximising cohesion and minimising coupling. The benefits of a bus is that it makes it easier to create factory districts that can be scaled independently of everything else. I suppose the question is how independent/integrated should these districts be? Should they only create ore? Or create utility-science-packs?
13. noddybear ◴[] No.43332569{4}[source]
We do actually track this. Each action an agent takes consumes 'ticks', which is how long it would take a human player to do it. For example, moving 10 tiles to the right takes something like 300 ticks (5 seconds).

However, our results only loosely indicate that smarter models use fewer ticks. I have an intuition that we will get more signal for tick-efficiency as models improve.

14. zelias ◴[] No.43332651[source]
I dunno, main bus strikes me as an efficient way to feed all the mall assemblers in the early game before I can scale up to a train based base.

After the early game, totally agreed, ban the bus.

replies(1): >>43333015 #
15. noddybear ◴[] No.43332757{3}[source]
I experimented a bit using deep search trees to find better Factorio trajectories (MCTS), which worked somewhat well. Unfortunately, it's very computationally expensive, and probably only makes sense in a training context (i.e gathering trajectories to train a model in a supervised setting).
16. mrighele ◴[] No.43332816[source]
> the game is about eventually building thousands of the new item.

I disagree that you need significant amount of thinking ahead. At the beginning spaghetti belt is fine, as you have little resources and you don't have the luxury of overbuilding. Once you start getting "bigger" and into more complex designs you can just leave what you already built how it is and build the new stuff somewhere else.

By the time you need to produce thousands of pieces of an item you can probably prepare a blueprint that builds the whole factory in a click.

My approach to factor.io is built on phasesw

1: build ad hoc infrastructure for the specific material that I need, close to the raw resources

2: prepare blueprints for specific resources, so that if I need more of something I can just build an extra factory. I make the blueprints so that I can compose them, like input belts on one side and output belt on the other. such "factories" are almost self contained, as in they get only a subset of materials (plates, plastic and stuff that involves liquids) and produce all the intermediate materials. This leaves some optimizations on the table, but simplify the logistic. Use trains to fetch resources from far.

3: compose the blueprints of the previous step to make "megafactories" with stations included. While at step 2 input and output of the factories are belts, at this step the input/output are train stations for specific material (with proper names, so I can add a new factory and trains will start delivering materials right away)

Of course my approach is not the only possible and probably not even efficient. I play for fun, with no care for the time it takes, as long as the time spent is enjoyable.

replies(1): >>43333240 #
17. aithrowawaycomm ◴[] No.43332833[source]
I think the point of “The Main Bus” in guides is that it’s easy for newer players, and thereby takes a lot of complexity off the table when you still haven’t really figured out how petroleum works, or keep falling behind the biters because you underestimated red bullets demand for steel, etc etc etc. Eventually you figure out trains; until then a main bus is an idiot-resistant way to carry resources across the entire base.
18. aithrowawaycomm ◴[] No.43332880[source]
Long-term planning is necessary if you have biters enabled: typically you need to secure territory/resources and invest in defenses before the resources run low and while the biters are still manageable. Otherwise things can get badly out of hand.

Edit: IMO the biggest difference between Satisfactory and Factorio is that Satisfactory has no crises. If a Satisfactory base shuts down it is annoying, but you can dig another miner / / build another plant / etc, entirely at your leisure. But in Factorio, a shutdown is an emergency with a ticking clock.

replies(1): >>43333197 #
19. eterm ◴[] No.43333015{3}[source]
Early game, I don't use trains either, I just spaghetti yellow belt around! I even run long belt lines to the first ring of mining outposts. It usually only costs ~500-1000 belt to do so, which sounds like a lot but really isn't much.

Advantage of yellow belt:

- Don't consume power - Don't produce pollution ( don't attract bugs ) - Are cheap. Rails themselves aren't too bad, costing only slighty more per tile than belts. But rails need stations, signals, chests, inserters for loading / unloading. Often combinators too. All add up to much higher investment, and critically in the early game, add up to more pollution produced before the pay-off of improved resources incoming. - Don't need lots of belt for the stations. A typical loading/unloading station can often have so much belt for "efficient" (fast) loading/unloading that you could have belt half your way to where you're going just laying that belt in a straight line.

If you're going further than the first ring of extra resources, then trains are amazing, but there are more than enough resources several times over in the first ring of resources to get a rocket launched. The first expansion ring tends to have 500k-1Mil per patch, and have several patches, so there's no need to go miles out pre-rocket.

It takes surprisingly little raw resources to actually launch a rocket. Someone did the maths once and calculated completing the game as needing a minimum of ~500k Iron ore. At 15/s, a single yellow belt can deliver that in under 10 hours, way below the time of a typical playthrough. This technically means a single smelting line is all you need to actually complete the game still in a reasonable time. Of course, trying to do so from a single lane would be extremely painful, and need a lot of attention to preventing over-production of intermediates, especially when it's much easier to make a few smelting lanes. I'm not recommending that, but I am recommending just slapping down new lanes and production wherever you feel like it and whenever you need, rather than pre-planning 4-lanes of plates that aren't actually useful of efficient.

Unless you mean your "main bus" is just 1x copper and 2x iron lane, in which case fair enough, but when I attack the concept of the bus, that's not what I'm railing against. What I'm railing against is the design pattern where people put down 4 lots of 4-wide lanes to bus far more iron, copper and intermediates than will ever be needed to actually launch a rocket.

Busses aren't efficient by any metric, other than minimising personality, creativity and thinking.

With a bus you just follow the same template done before, it'll get you to the end, but it teaches bad habits and isn't efficient. It isn't quick either, you'll spend a long time putting down lanes and splitters and undergrounds. All of which need producing.

For late-game (post-rocket), then either a bus design or train / city-block designs can work well. I prefer trains, but large busses have their place for mega-bases.

But for pre-rocket, or anyone starting out, spaghetti is absolutely the way to go. It'll also better teach you via your own mistakes.

20. vessenes ◴[] No.43333131[source]
Lots of ways to play. That said, there is a significant amount of tooling to support the style of “stamping” working units down. In fact, I’d argue it’s a key part of the game in that you start with small working units, and the game helps you scale them up to midsize (say on the map view you can still see the parts) and then again to large scale (you can only see blocks on the map). This is why the blueprint tool allows offsetting and grid refinement.

So, I almost never build a large main bus style base, but it’s a fun (and helpful) part of the game’s design and tooling to allow you to create very large working systems out of a variety of component scales — “buses” enable this, and make it MUCH easier to implement.

21. martbakler ◴[] No.43333197{3}[source]
We were thinking of creating a minigame resembling a "tower-defense" setting, where waves of bugs get released and the agent needs to create appropriate defenses. It would be interesting to see if agents are capable of defending the base and how much resources would they put towards defenses in a normal game where enemies are enabled
22. Philpax ◴[] No.43333224[source]
> It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set.

I'm not sure we can conclude the game's not in their training set - a lot has been written about Factorio on the internet, and a lot has been recorded; the newer multimodal LLMs have trained on YouTube, especially Gemini.

replies(1): >>43334140 #
23. bombcar ◴[] No.43333240[source]
You can certainly build with a main bus, and segmented factories doing what they do in perfect Nilaus city blocks. It's quite like perfectly designed and planned code; though you run the risk of it becoming just a blueprint plopping game.

But it (for me at least) is so much more fun building the spaghetti and making things work, refactoring as you go, and expanding organically.

24. kortilla ◴[] No.43333259[source]
A bus is absolutely not needed. I played through many early games and got to the launch without building a main bus.
25. ◴[] No.43333347[source]
26. ◴[] No.43333353[source]
27. infogulch ◴[] No.43333490[source]
I agree with everything you said, but I'd like to defend the main bus as a learning phase.

The main bus is for production capacity discovery. New players have to discover the production tree somehow and browsing recipe cards with a calculator before starting is not the way, it has to be hands-on and interactive. Dividing up your factory into (arbitrarily defined) intermediates production and final product production is helpful to cut through the complexity. Of course, intermediate production will be severely underbuilt and logistics capacity will never keep up with consumption of all of the final product builds, but it will be enough to run a couple builds at once and test new ones.

Once new players have a grasp of the scope of the production tree, with tangible examples of sub-factories, then they can begin to consider end-to-end production capacity.

28. sdwr ◴[] No.43333504[source]
The most challenging part of factorio is routing inputs and outputs in 2D. If you build naively, intersections between ingredient belts choke the life out of your factory (3 utilities problem). Bus guarantees building space with easy access to all ingredients.

Sure, if you already know what you're doing, you can plan ahead and leave space for the right things. For everyone else, it simplifies the game to a manageable level.

29. tavavex ◴[] No.43333743[source]
I've played other factory-building games, but not Factorio, so I'm not familiar with the bus-building paradigm. I feel like you're saying that buses would incentivize bad practices, but at the same time I don't see what would make them inherently bad. Whenever I saw screenshots of Factorio, I thought that buses were more of a logistics tool, a way to cable-manage the delivery of stuff from one place to another. Is this wrong? I feel like, if you have more consumers than producers (and end up having to rely on buffering), then you've got a big problem regardless of whether you have a bus or not - a sufficiently long belt from an ore deposit etc could replicate the big-buffer problem in the same way. I don't think I'd use buses, I like a bit of chaos, but still, I'm not sure if they're that bad.
replies(1): >>43334282 #
30. noddybear ◴[] No.43334140[source]
The models have some understanding of the game and the initial build orders that they should adopt. What they don't remember is: 1) How many resources each item costs to make 2) Spatial gotchas such as a belt requiring an inserter to load/offload from.
31. eterm ◴[] No.43334282{3}[source]
Buffer is bad precisely because it increases the lag between having a gap in provision, and that gap being obvious to the player.

With no buffering, as soon as your demand for steel is greater than than production of steel, then the bottleneck is immediate and obvious. The solution is also immediate and obvious: Build more steel.

Buffering, in particular belt buffering, and in particular busses, contribute to mask this issue. There can be a great delay between increasing consumption above production, and so the root cause can be very hidden. It may also be that the ultimate root cause is that steel production is low because it's limited on how much iron ore it gets. If everything is bussed, then it can be hours before resource constraints are hit, by which point it's very hard to see what's happened to cause the shortage, and also by which time the factory may have expanded further.

It also constributes to see-saw production, where-by a shortage in one area causes a pause, which allievates the root cause shortage for a while by backing up other production. The longer the lag between cause and effect, the greater the banding effect, further masking the root cause.

A bus also encourages bottom-up, which further encourages massive over-consumption of base resources. If you start building green chip production and bussing it, the bus may will fill and buffer, making it look like you've got plenty of green chips. In turn, as the green chip production stops, it'll look like you've got plenty of iron plates. You'll build all your malls and other production, satisfied as you build each that they run fine and are not over-consuming.

Only when later everything starts to run at once you realise that the stuff down the end of the line is getting scant resources, as previously each part was running in isolation before serious amounts were required.

In contrast, a top-down approach involves building the final result first, then at each step building what's needed to feed it. This ensures that there is always enough provision, and everything can be placed to minimise buffer to reduce lag and improve feedback time on problems. It also reduces pollution since any item on a belt represents inventory for which you've paid a pollution cost but not got any final results from yet.

The spaghetti approach can lead to "under-utilised" buildings, such as smelting array that ends up only needing to supply 0.3 of a belt. But in factorio space is almost endless, and there's little to no cost to idle buildings. The power drain of idle assemblers, particular the bare (no module) level 2 buildings you'll likely be building before end-game, is extremely low.

For late game post-rocket, this changes of course. With beacons and level 3 assemblers with modules, the idle draw is significant, and you may want to optimise ratios and look to eliminate how many assemblers you run idle. ( That said, power is almost non-issue in 2.0 with nuclear power being much easier to run efficiently than previously, so the large solar fields aren't really needed anymore. )

Busses have a strong visual appeal, but unlike "cable management", there's no airflow to consider in factorio. A messy spaghetti base isn't inherently inefficient. It doesn't affect productivity to just run short belts all over.

The visual temptation of the mega-bus is clearly alluring, it looks good on youtube video guides.

replies(1): >>43334623 #
32. tavavex ◴[] No.43334623{4}[source]
That makes sense. I guess I just had a different approach when I played the other games. The way I organized in other factory games is that the considerations of input and output were things that I thought of upfront - I never eyeballed and then tried to estimate the production speed based on how fast my resources were drained. I might be overplanning, or maybe Factorio encourages a far more chaotic approach, but I always treated factories as black boxes that take X/s of certain items and outputted X/s results. Knowing precisely how many items per second I have on any individual belt is the most essential piece of knowledge to me, so I never relied on buffering and always made sure to build consumer factories that never overwhelmed producer factories. This means that the visual indication of the buffer draining would only signal some building mistake to me, rather than a design mistake.
33. lupusreal ◴[] No.43335367[source]
I delete my bus as soon as I get bots (in favor of a train base feeding a bot mall), but I've found that a small and not overly strict bus is the fastest way, for me, to get bots unlocked.
34. deterministic ◴[] No.43377310[source]
I have tried different approaches and ended up with a single small bus of raw materials (coal,cobber,iron,stone) with everything else hanging off it. It scales amazingly well and avoid spaghetti layouts.

Oil stuff is done separately and fed into the structure where needed.