Most active commenters
  • eterm(3)

←back to thread

749 points noddybear | 13 comments | | HN request time: 0.766s | source | bottom

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

Show context
WJW ◴[] No.43332084[source]
Very cool and also pretty expected results tbh. Some thoughts:

Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.

Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.

It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.

I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.

replies(9): >>43332165 #>>43332202 #>>43332340 #>>43332409 #>>43332816 #>>43333224 #>>43333259 #>>43333347 #>>43333353 #
1. eterm ◴[] No.43332409[source]
> Building a main bus vs spaghetti belts is one of the obvious examples here

I'm an anti-bus extremist. ( I've even considered regsitering BanTheBus.com and doing an over-the-top static anti-bus website. ), so take what I'm about to say with a pinch of salt. Also note that my post applies to non-space-age. Space-age changes the gameplay fundamentally, so this really only applies to factorio 1.1 or 2.0 with space-age disabled. Gleba in particular breaks the JIT model. ( Fuck Gleba. )

Busses are the opposite of good factorio factories. They undo a lot of the benefits of a healthy just-in-time (JIT) manufacturing, by encouraging massive amounts of buffer (belt-buffer).

They also encourage people to anti-learn fundamental principles. You often people do "starter-busses" with 4 lanes of iron plates, but only fed by one actual belt worth of smelting. Then people look for all kinds of "balancing" solutions to try to alchemize one belt into 4 belts.

They encourage massive amounts of over-spend on expensive splitters to keep "balancing" the bus to make it look neat, over actually just focusing on what needs to be built.

Spaghetti on the other hand is much better for actually getting to the end-goal. Start by placing what you want to build, then look for what it needs. Work out how to feed it by any means necessary. If you dont' have enough input, then build more of that input. Then repeat as necessary.

There's no such thing as too much input. With short belts (even direct insert where possible), buffers are kept to a minimum, and any "overprodction" is stopped at source, because assemblers don't produce if they have nowhere to output into.

The biggest classic beginner mistakes in factorio are:

- Sticking things in chests. Even worse, trying to "maintain production" by picking up those chest contents. ( This comes from an RTS mindset where "idle" workers are a big sin. )

- Trying to increase throughput by replacing yellow belt with red belt when their yellow belt wasn't saturated.

- Looking for guides and discovering "The Main Bus".

That last point is so common, and not only does it take away some of the creativity of the game, but busses are inherently a bad solution that makes all bases look the same, and produces a mediocre result.

Look at how speedrunners are able to complete the game on default settings in sub 2hr30. They're not producing oodles of red belts. They're not producing main busses. They're not even producing railways. They're hyper focused on what's actually needed, which is very little indeed.

replies(9): >>43332549 #>>43332651 #>>43332833 #>>43333131 #>>43333490 #>>43333504 #>>43333743 #>>43335367 #>>43377310 #
2. noddybear ◴[] No.43332549[source]
While I agree that busses incentivise over-production, my only argument in their favour stems from SWE design best-practices of maximising cohesion and minimising coupling. The benefits of a bus is that it makes it easier to create factory districts that can be scaled independently of everything else. I suppose the question is how independent/integrated should these districts be? Should they only create ore? Or create utility-science-packs?
3. zelias ◴[] No.43332651[source]
I dunno, main bus strikes me as an efficient way to feed all the mall assemblers in the early game before I can scale up to a train based base.

After the early game, totally agreed, ban the bus.

replies(1): >>43333015 #
4. aithrowawaycomm ◴[] No.43332833[source]
I think the point of “The Main Bus” in guides is that it’s easy for newer players, and thereby takes a lot of complexity off the table when you still haven’t really figured out how petroleum works, or keep falling behind the biters because you underestimated red bullets demand for steel, etc etc etc. Eventually you figure out trains; until then a main bus is an idiot-resistant way to carry resources across the entire base.
5. eterm ◴[] No.43333015[source]
Early game, I don't use trains either, I just spaghetti yellow belt around! I even run long belt lines to the first ring of mining outposts. It usually only costs ~500-1000 belt to do so, which sounds like a lot but really isn't much.

Advantage of yellow belt:

- Don't consume power - Don't produce pollution ( don't attract bugs ) - Are cheap. Rails themselves aren't too bad, costing only slighty more per tile than belts. But rails need stations, signals, chests, inserters for loading / unloading. Often combinators too. All add up to much higher investment, and critically in the early game, add up to more pollution produced before the pay-off of improved resources incoming. - Don't need lots of belt for the stations. A typical loading/unloading station can often have so much belt for "efficient" (fast) loading/unloading that you could have belt half your way to where you're going just laying that belt in a straight line.

If you're going further than the first ring of extra resources, then trains are amazing, but there are more than enough resources several times over in the first ring of resources to get a rocket launched. The first expansion ring tends to have 500k-1Mil per patch, and have several patches, so there's no need to go miles out pre-rocket.

It takes surprisingly little raw resources to actually launch a rocket. Someone did the maths once and calculated completing the game as needing a minimum of ~500k Iron ore. At 15/s, a single yellow belt can deliver that in under 10 hours, way below the time of a typical playthrough. This technically means a single smelting line is all you need to actually complete the game still in a reasonable time. Of course, trying to do so from a single lane would be extremely painful, and need a lot of attention to preventing over-production of intermediates, especially when it's much easier to make a few smelting lanes. I'm not recommending that, but I am recommending just slapping down new lanes and production wherever you feel like it and whenever you need, rather than pre-planning 4-lanes of plates that aren't actually useful of efficient.

Unless you mean your "main bus" is just 1x copper and 2x iron lane, in which case fair enough, but when I attack the concept of the bus, that's not what I'm railing against. What I'm railing against is the design pattern where people put down 4 lots of 4-wide lanes to bus far more iron, copper and intermediates than will ever be needed to actually launch a rocket.

Busses aren't efficient by any metric, other than minimising personality, creativity and thinking.

With a bus you just follow the same template done before, it'll get you to the end, but it teaches bad habits and isn't efficient. It isn't quick either, you'll spend a long time putting down lanes and splitters and undergrounds. All of which need producing.

For late-game (post-rocket), then either a bus design or train / city-block designs can work well. I prefer trains, but large busses have their place for mega-bases.

But for pre-rocket, or anyone starting out, spaghetti is absolutely the way to go. It'll also better teach you via your own mistakes.

6. vessenes ◴[] No.43333131[source]
Lots of ways to play. That said, there is a significant amount of tooling to support the style of “stamping” working units down. In fact, I’d argue it’s a key part of the game in that you start with small working units, and the game helps you scale them up to midsize (say on the map view you can still see the parts) and then again to large scale (you can only see blocks on the map). This is why the blueprint tool allows offsetting and grid refinement.

So, I almost never build a large main bus style base, but it’s a fun (and helpful) part of the game’s design and tooling to allow you to create very large working systems out of a variety of component scales — “buses” enable this, and make it MUCH easier to implement.

7. infogulch ◴[] No.43333490[source]
I agree with everything you said, but I'd like to defend the main bus as a learning phase.

The main bus is for production capacity discovery. New players have to discover the production tree somehow and browsing recipe cards with a calculator before starting is not the way, it has to be hands-on and interactive. Dividing up your factory into (arbitrarily defined) intermediates production and final product production is helpful to cut through the complexity. Of course, intermediate production will be severely underbuilt and logistics capacity will never keep up with consumption of all of the final product builds, but it will be enough to run a couple builds at once and test new ones.

Once new players have a grasp of the scope of the production tree, with tangible examples of sub-factories, then they can begin to consider end-to-end production capacity.

8. sdwr ◴[] No.43333504[source]
The most challenging part of factorio is routing inputs and outputs in 2D. If you build naively, intersections between ingredient belts choke the life out of your factory (3 utilities problem). Bus guarantees building space with easy access to all ingredients.

Sure, if you already know what you're doing, you can plan ahead and leave space for the right things. For everyone else, it simplifies the game to a manageable level.

9. tavavex ◴[] No.43333743[source]
I've played other factory-building games, but not Factorio, so I'm not familiar with the bus-building paradigm. I feel like you're saying that buses would incentivize bad practices, but at the same time I don't see what would make them inherently bad. Whenever I saw screenshots of Factorio, I thought that buses were more of a logistics tool, a way to cable-manage the delivery of stuff from one place to another. Is this wrong? I feel like, if you have more consumers than producers (and end up having to rely on buffering), then you've got a big problem regardless of whether you have a bus or not - a sufficiently long belt from an ore deposit etc could replicate the big-buffer problem in the same way. I don't think I'd use buses, I like a bit of chaos, but still, I'm not sure if they're that bad.
replies(1): >>43334282 #
10. eterm ◴[] No.43334282[source]
Buffer is bad precisely because it increases the lag between having a gap in provision, and that gap being obvious to the player.

With no buffering, as soon as your demand for steel is greater than than production of steel, then the bottleneck is immediate and obvious. The solution is also immediate and obvious: Build more steel.

Buffering, in particular belt buffering, and in particular busses, contribute to mask this issue. There can be a great delay between increasing consumption above production, and so the root cause can be very hidden. It may also be that the ultimate root cause is that steel production is low because it's limited on how much iron ore it gets. If everything is bussed, then it can be hours before resource constraints are hit, by which point it's very hard to see what's happened to cause the shortage, and also by which time the factory may have expanded further.

It also constributes to see-saw production, where-by a shortage in one area causes a pause, which allievates the root cause shortage for a while by backing up other production. The longer the lag between cause and effect, the greater the banding effect, further masking the root cause.

A bus also encourages bottom-up, which further encourages massive over-consumption of base resources. If you start building green chip production and bussing it, the bus may will fill and buffer, making it look like you've got plenty of green chips. In turn, as the green chip production stops, it'll look like you've got plenty of iron plates. You'll build all your malls and other production, satisfied as you build each that they run fine and are not over-consuming.

Only when later everything starts to run at once you realise that the stuff down the end of the line is getting scant resources, as previously each part was running in isolation before serious amounts were required.

In contrast, a top-down approach involves building the final result first, then at each step building what's needed to feed it. This ensures that there is always enough provision, and everything can be placed to minimise buffer to reduce lag and improve feedback time on problems. It also reduces pollution since any item on a belt represents inventory for which you've paid a pollution cost but not got any final results from yet.

The spaghetti approach can lead to "under-utilised" buildings, such as smelting array that ends up only needing to supply 0.3 of a belt. But in factorio space is almost endless, and there's little to no cost to idle buildings. The power drain of idle assemblers, particular the bare (no module) level 2 buildings you'll likely be building before end-game, is extremely low.

For late game post-rocket, this changes of course. With beacons and level 3 assemblers with modules, the idle draw is significant, and you may want to optimise ratios and look to eliminate how many assemblers you run idle. ( That said, power is almost non-issue in 2.0 with nuclear power being much easier to run efficiently than previously, so the large solar fields aren't really needed anymore. )

Busses have a strong visual appeal, but unlike "cable management", there's no airflow to consider in factorio. A messy spaghetti base isn't inherently inefficient. It doesn't affect productivity to just run short belts all over.

The visual temptation of the mega-bus is clearly alluring, it looks good on youtube video guides.

replies(1): >>43334623 #
11. tavavex ◴[] No.43334623{3}[source]
That makes sense. I guess I just had a different approach when I played the other games. The way I organized in other factory games is that the considerations of input and output were things that I thought of upfront - I never eyeballed and then tried to estimate the production speed based on how fast my resources were drained. I might be overplanning, or maybe Factorio encourages a far more chaotic approach, but I always treated factories as black boxes that take X/s of certain items and outputted X/s results. Knowing precisely how many items per second I have on any individual belt is the most essential piece of knowledge to me, so I never relied on buffering and always made sure to build consumer factories that never overwhelmed producer factories. This means that the visual indication of the buffer draining would only signal some building mistake to me, rather than a design mistake.
12. lupusreal ◴[] No.43335367[source]
I delete my bus as soon as I get bots (in favor of a train base feeding a bot mall), but I've found that a small and not overly strict bus is the fastest way, for me, to get bots unlocked.
13. deterministic ◴[] No.43377310[source]
I have tried different approaches and ended up with a single small bus of raw materials (coal,cobber,iron,stone) with everything else hanging off it. It scales amazingly well and avoid spaghetti layouts.

Oil stuff is done separately and fed into the structure where needed.