Most active commenters

noddybear(44)
(13)
martbakler(10)
vessenes(8)
infogulch(5)
pyinstallwoes(4)
jharohit(3)
WJW(3)
eterm(3)
throitallaway(3)

Popular/hot comments

>>43332084 #
>>43332409 #
>>43331994 #
>>43334576 #
>>43331679 #
>>43332667 #
>>43331894 #
>>43332192 #
>>43333045 #
>>43334138 #
>>43342548 #

Show HN: Factorio Learning Environment – Agents Build Factories

(jackhopkins.github.io)

I'm Jack, and I'm excited to share a project that has channeled my Factorio addiction recently: the Factorio Learning Environment (FLE).

FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.

A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.

The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.

Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.

Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)

We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map

We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).

The code is available at https://github.com/JackHopkins/factorio-learning-environment.

You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+

The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.

We would love to hear your thoughts and see what others can do with this framework!

1. p10jkle ◴[11 Mar 25 12:15 UTC] No.43331679[source]▶

>>43331582 (OP) #

Wow, fascinating. I wonder if in a few years every in-game opponent will just be an LLM with access to a game-controlling API like the one you've created.

Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?

replies(4): >>43331727 #>>43331794 #>>43332192 #>>43339521 #

2. noddybear ◴[11 Mar 25 12:23 UTC] No.43331727[source]▶

>>43331679 #

Hey - yes, I think this is definitely possible, as you don't need any training compute for it to work. Its super easy to plug-and-play different models into new games, once an API is made available.

Models struggle in 2 main areas. The first is spatial reasoning: often the models make off-by-one errors which they find it hard to recover from (as factories are very sensitive to these mistakes - like in programming). The second is in long-term planning, i.e figuring out what to do strategically, before making tactical subgoals.

The difficulty scales in lab-play generally in proportion to the depth of the production chains. If an item requires several factory segments first, this makes it a lot more challenging. I think this is related to planning though, as the models tend to get down 'into the weeds' of fixing minor issues - rather than coming up with a master plan first.

replies(1): >>43331824 #

3. leetbulb ◴[11 Mar 25 12:32 UTC] No.43331782[source]▶

>>43331582 (OP) #

Very cool project. Lovely diagrams.

replies(1): >>43332699 #

4. posterman ◴[11 Mar 25 12:33 UTC] No.43331794[source]▶

>>43331679 #

"claude plays pokemon" shows that it struggles with mount moon (as did four year old me)

5. pyinstallwoes ◴[11 Mar 25 12:37 UTC] No.43331824{3}[source]▶

>>43331727 #

Have you tried specific prompting like writing a mermaid diagram that forces the model to contextual use long term horizon tasks ?

replies(1): >>43332075 #

6. zelias ◴[11 Mar 25 12:38 UTC] No.43331833[source]▶

>>43331582 (OP) #

Fantastic! Now I can sit back and watch the factory grow itself!

More seriously, I think this is a great "next step" in the evolution of benchmarks. Contemporary benchmarks are really just standardized tests, but the Factorio problem set presents an unbounded canvas of creativity for solutioning.

7. artemonster ◴[11 Mar 25 12:46 UTC] No.43331881[source]▶

>>43331582 (OP) #

Diagonal belts are signs of evil

8. tux3 ◴[11 Mar 25 12:47 UTC] No.43331894[source]▶

>>43331582 (OP) #

For the real frontier benchmark, at the edge of what humans will put up with, install Pyanodon's mod. The scale of it puts a real strain on your organizational skills. Overbuilding, underbuilding, or bad planning can all cause significant pain down the line as the factory risks becoming an unmanageable, tangled mess with no sane capacity for expansion. It's a real test of executive function and organization.

For something humans will definitely not put up with, install PyBlock in Hard Mode. I suspect that benchmark will not fall anytime soon. It is borderline impossible without superhuman patience.

replies(3): >>43331939 #>>43332373 #>>43333355 #

9. keeganpoppen ◴[11 Mar 25 12:48 UTC] No.43331903[source]▶

>>43331582 (OP) #

this is an absolutely fascinating project-- wow! i am going to have to fire up Factorio again and try it out! the implications of what the experience of playing games is like in this new LLM era / world order are fascinating.

replies(1): >>43332074 #

10. jharohit ◴[11 Mar 25 12:51 UTC] No.43331920[source]▶

>>43331582 (OP) #

Everytime a paper like this comes out, I always have 1 question - How do they control the game using the LLMs? How does the control-feedback loop work? WHat tools, software and APIs they use to do it on Mac or Windows?

replies(2): >>43331985 #>>43332012 #

11. moconnor ◴[11 Mar 25 12:52 UTC] No.43331931[source]▶

>>43331582 (OP) #

Incredible idea and execution, very interesting results. Genuinely: what a time to be alive!

replies(1): >>43332060 #

12. wegfawefgawefg ◴[11 Mar 25 12:54 UTC] No.43331939[source]▶

>>43331894 #

and i thought gleba was hard

13. mentalgear ◴[11 Mar 25 12:57 UTC] No.43331962[source]▶

>>43331582 (OP) #

> [LLMs] yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis

This reflects my experience with gen-LLM coding, where LLMs keep trying to do the same thing in a loop.

replies(1): >>43332048 #

14. sturza ◴[11 Mar 25 13:00 UTC] No.43331980[source]▶

>>43331582 (OP) #

> 1. Coding skill predicts performance

> Models with stronger coding abilities (Claude 3.5-Sonnet, GPT-4o) achieved higher Production Scores and completed more lab tasks. Claude outperformed others with a PS of 293,206 and 28 milestones, progressing beyond early-game resource extraction.

replies(1): >>43336471 #

15. collinvandyck76 ◴[11 Mar 25 13:00 UTC] No.43331985[source]▶

>>43331920 #

OP made the framework available https://github.com/JackHopkins/factorio-learning-environment

replies(2): >>43332031 #>>43339208 #

16. alexop ◴[11 Mar 25 13:01 UTC] No.43331994[source]▶

>>43331582 (OP) #

its funny how video games are the hardest benchmark that humanity has for ai

replies(5): >>43332020 #>>43332131 #>>43332240 #>>43332889 #>>43336981 #

17. ◴[11 Mar 25 13:03 UTC] No.43332012[source]▶

>>43331920 #

18. quchen ◴[11 Mar 25 13:04 UTC] No.43332020[source]▶

>>43331994 #

A video game is a very well-defined problem, and usually comes with simple metrics for success – health, time, or in Factorio’s case, ultimately science per minute (or per minute played, for AIs?). Real world problems are much harder to define, they are embedded in a very complex ecosystem, and it’s not clear at all what to optimize for.

19. loveparade ◴[11 Mar 25 13:04 UTC] No.43332022[source]▶

>>43331582 (OP) #

"put the right signals into my train network"

Not even humans can pass this benchmark.

replies(1): >>43332047 #

20. noddybear ◴[11 Mar 25 13:06 UTC] No.43332031{3}[source]▶

>>43331985 #

So the core insight was that we can take over the Factorio console remotely using RCON over TCP. From this, we implemented a server-side library of tools that run inside the game. We then implemented a client-side Python library that can invoke these tools - resulting in a Python API for the game. A nice side effect is that creating new tools is really easy, and they can be hot-loaded into running game servers (unlike the traditional Factorio modding approach).

replies(1): >>43339213 #

21. noddybear ◴[11 Mar 25 13:08 UTC] No.43332048[source]▶

>>43331962 #

We once saw GPT4o spend something like 100 repeated interactions trying the action known not to work (before snapping out of it). My intuition here is that this is a result of target fixation - the more repetitions of something it does, the more likely it is to keep repeating it, because it occupies more of the context.

replies(1): >>43336966 #

22. quchen ◴[11 Mar 25 13:08 UTC] No.43332047[source]▶

>>43332022 #

I beg to differ! But it takes a while for it to become intuitive. »Chain in, rail out« gets you 90% there though.

replies(1): >>43332244 #

23. noddybear ◴[11 Mar 25 13:09 UTC] No.43332060[source]▶

>>43331931 #

Thank you! Very much a labour of love. The next step for us is to try and build a paperclip maximiser in FLE.

replies(1): >>43332277 #

24. Aardwolf ◴[11 Mar 25 13:11 UTC] No.43332074[source]▶

>>43331903 #

I'd love to finally see interesting (non-predictable) AI opponents in games like StarCraft and Age of Mythology!

Since a good AI is too likely to beat humans due to high APM, don't limit their intelligence, instead limit their APM...

replies(1): >>43332103 #

25. noddybear ◴[11 Mar 25 13:11 UTC] No.43332075{4}[source]▶

>>43331824 #

Yes we tried that - as well as a few other visual DSLs for spatial reasoning. They didn't seem to help much, i.e there were no failure modes that this approach solved compared to the simpler approach. As ARC-AGI results showed - there don't seem to be many 'free lunch' solutions to this without actually training.

26. WJW ◴[11 Mar 25 13:12 UTC] No.43332084[source]▶

>>43331582 (OP) #

Very cool and also pretty expected results tbh. Some thoughts:

Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.

Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.

It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.

I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.

replies(9): >>43332165 #>>43332202 #>>43332340 #>>43332409 #>>43332816 #>>43333224 #>>43333259 #>>43333347 #>>43333353 #

27. noddybear ◴[11 Mar 25 13:13 UTC] No.43332103{3}[source]▶

>>43332074 #

Yes! Better strategy would be great - especially for grand strategy games (like Paradox EU4 etc). Even more for games with aspects of diplomacy...

28. myrmidon ◴[11 Mar 25 13:16 UTC] No.43332121[source]▶

>>43331582 (OP) #

Fascinating. Would have loved to see more pictures of the bigger factories-- or is the zig-zag belt into plastic production currently the best result?

I think this very clearly illustrates a big weakness of current LLMs-- humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't-- yet.

I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.

Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?

replies(1): >>43332317 #

29. WJW ◴[11 Mar 25 13:18 UTC] No.43332131[source]▶

>>43331994 #

They're not the hardest problems we have, they are just very nice benchmark tools because by definition they already run on a computer and you can fairly easily interface an AI with them.

There's probably also a distorting factor in that all the AI research into stock market and military applications probably doesn't get published, so it seems like video game AIs are a much larger percentage of research than it actually is.

30. accurrent ◴[11 Mar 25 13:20 UTC] No.43332165[source]▶

>>43332084 #

Whats very interesting is if we could use LLMs to generate GOFAI methods. Its often not at all obvious how to do so. Than being said its still hard to express goals in terms of natural language and resources to LLMs. I;ve been trying different things and none seems to work for me to say hey this is a step improvement. Its also hard to come up with a dataset for these use cases.

replies(1): >>43332217 #

31. noirscape ◴[11 Mar 25 13:23 UTC] No.43332192[source]▶

>>43331679 #

Very unlikely that you'll see mass-use of LLMs as opponents. Enemy AI in most games doesn't need the level of complexity that machine learning demands. (Ignoring computational costs for a second.)

The main goal of an enemy AI isn't to be the hardest thing in the world, it's to provide an interesting challenge for the player to overcome. It's not necessarily difficult to make a hypercompetent AI in most games, but that also wouldn't make it very interesting to play against. Most games have finite states of logic, just large enough to the point where a human would have trouble finding every solution to it (although humans tend to be very good at pushing on the edges of these states to find ways around them).

Even in games where the amount of state is much higher than usual, you rarely want a super AI; nobody likes playing against an aimbot in an FPS for example.

Factorio is an outlier because unlike regular games, the true condition for a "victory" is almost entirely up to the player. You can make a rocket in non-DLC Factorio (the games victory condition) without building any factory at all beyond the most basic structures for stuff you can't handcraft. It'd be extremely slow, but it's an option. That's why the benchmark for this sort of thing is more efficiency than it is "can this work".

replies(3): >>43332471 #>>43334422 #>>43336036 #

32. noddybear ◴[11 Mar 25 13:24 UTC] No.43332202[source]▶

>>43332084 #

I definitely agree that planning is essential to perform well in Factorio - my hope is that we can create agents in FLE that can better front-load the planning part, as well as create utility functions for future use - such that as the agent progresses, it can do more and more in each program / step. For example, it could create a function called 'resolve_resource_dependencies', which would enable it to backfill missing resources in order to proceed.

LLMs tend to build themselves into corners here quite often. Basically, if they break the topology (e.g enclose their factory in pipes) they struggle to reason over it and correct it. My basic view on this is that there exists some set of functions/data-structures that they can design in FLE, which will give them a better view over their factory to enable scaling (if the models take a step back to consider it).

We currently do track SPM, but decided against making that our main metric, as it zeroes out in the early stages. We use 'production score' instead, which is a more generalised metric that just captures total production (multiplied by an item-price).

There was a cool paper that came out a few years ago using meta-heuristics to do this, (https://arxiv.org/abs/2102.04871), but I reckon the combinatorial complexity of large factories makes it challenging to solve beyond trivial factories.

Its worth noting that agents in FLE can write their own libraries etc, so a dominant strategy could be for an LLM agent to implement a solver in Python to do the heavy lifting. This is quite far from current capabilities though.

replies(1): >>43332326 #

33. noddybear ◴[11 Mar 25 13:26 UTC] No.43332217{3}[source]▶

>>43332165 #

FLE agents technically can implement their own Python libraries to leverage GOFAI to do the heavy lifting. None has actually attempted this yet though. It would be interesting to see if this can be achieved just by modifying the manual given to the agents to bias in favour of this approach.

replies(1): >>43332302 #

34. sc68cal ◴[11 Mar 25 13:26 UTC] No.43332218[source]▶

>>43331582 (OP) #

The diagonal belts and diagonal pipes are especially cursed.

replies(1): >>43332527 #

35. lucianbr ◴[11 Mar 25 13:28 UTC] No.43332240[source]▶

>>43331994 #

It is "hardest" in a context of the AI actually having a chance.

There's no problem asking AI for the blueprints to a working faster-than-light spaceship, only we already know the AI will fail, and the way it fails provides no useful information.

36. Aardwolf ◴[11 Mar 25 13:29 UTC] No.43332244{3}[source]▶

>>43332047 #

In both Factorio and Satisfactory I end up using long (and I mean, multi-kilometers-long) belts all the time and never trains.

Something about the track building being clunky, or I don't know what really the underlying thing that's making me prefer the simplicity of items-moving-and-splitting-and-merging-on-belts is

replies(1): >>43332956 #

37. mritchie712 ◴[11 Mar 25 13:29 UTC] No.43332245[source]▶

>>43331582 (OP) #

have you tried sonnet 3.7 yet? guessing these aren't cheap evals to run.

leaderboard: https://jackhopkins.github.io/factorio-learning-environment/...

replies(2): >>43332348 #>>43332354 #

38. uncertainrhymes ◴[11 Mar 25 13:32 UTC] No.43332277{3}[source]▶

>>43332060 #

Given that Factorio has logic gates and people have built various programs (including Doom, iirc) how long will it take before someone runs an LLM inside the game?

replies(1): >>43332371 #

39. accurrent ◴[11 Mar 25 13:35 UTC] No.43332302{4}[source]▶

>>43332217 #

That does sound interesting. I might attempt it. Thanks for this benchmark, I totally could use it for my PhD (I started with GOFAI, but have hit a dead end. My advisor is suggesting pivoting into using LLMs to call my GoFAI framework.

replies(1): >>43332414 #

40. noddybear ◴[11 Mar 25 13:37 UTC] No.43332317[source]▶

>>43332121 #

I have some pictures of bigger factories - but they tend to be filled with artefacts and general nonsense. I'll dig them out and add them to the appendix. The zig-zag into plastic production was the best 'lab' result, as its pretty clear what the agent is doing.

Yes, the agents can consistently produce economic growth in game - but we don't really see a take off, where the growth keeps compounding over time. This is certainly _possible_ in FLE, as agents could write their own Python utility functions etc to construct and manage large factories (imagine imperative Factorio blueprints), but we haven't seen that yet.

Designing the API to not get in the way was the biggest challenge. It was imperative to avoid modal collapse - where the factory could not be sufficiently well expressed in the outputs of a program. While we think that we have generally 'solved' this, there are occasionally examples where the agent acts based on its previous output, but fails because there is something blocking it that it cannot easily see. One example would be the edge of water getting in the way of an entity placement.

All of the lab tasks were completed by a human using only the API, and we have lots of tests (inductively) demonstrating that it is possible to get to a rocket launch using the API alone.

41. WJW ◴[11 Mar 25 13:38 UTC] No.43332326{3}[source]▶

>>43332202 #

An agent writing its own library to interface with a good solver like Z3 (or even writing some basic planning algorithms itself) seems like the epitome of a "costly long term investment that does nothing for the short term". The only thing I can see overcoming such problems are deep search trees, but AFAIK that is not how LLMs work at all.

replies(1): >>43332757 #

42. kevmo314 ◴[11 Mar 25 13:39 UTC] No.43332331[source]▶

>>43331582 (OP) #

Does it provide screenshots of the game state? I, too, would struggle to play the game pretty effectively if I could not visually see the game.

replies(2): >>43332496 #>>43332607 #

43. lupusreal ◴[11 Mar 25 13:40 UTC] No.43332340[source]▶

>>43332084 #

I'm not convinced that factorio requires planning ahead for computer players. For human players it certainly does, because tearing up your factory and rebuilding to fix shortsighted designs has a steep time/labor cost. Even for human players though, this cost becomes mostly a psychological obstacle once you get construction bots.

replies(2): >>43332416 #>>43332880 #

44. danso ◴[11 Mar 25 13:41 UTC] No.43332346[source]▶

>>43331582 (OP) #

Tangentially: been wondering when we’d ever see the breakthroughs in LLMs trickle down to making better adversarial game AIs. Haven’t tried Civ 7 b/c of its terrible reviews, but I’d happily buy in if there were AIs that were more human-like and varied in their scheming

replies(2): >>43332386 #>>43333869 #

45. noddybear ◴[11 Mar 25 13:41 UTC] No.43332348[source]▶

>>43332245 #

Not yet, but starting the runs for 3.7 later today! The cost for running all the evals (across all models) was about $10k.

Simply giving the agents access to the tool descriptions and API schema is like 20k tokens from the outset.

It would be really cool to use retrieval techniques to reduce this burden. I suspect that this will also outright improve the performance of all models - which becomes worse as the context scales.

replies(2): >>43332364 #>>43338172 #

46. ◴[11 Mar 25 13:41 UTC] No.43332354[source]▶

>>43332245 #

47. 0xmason ◴[11 Mar 25 13:43 UTC] No.43332364{3}[source]▶

>>43332348 #

Please add the results of the eval to your leaderboard / github! Looking forward to it.

GPT 4.5 seems prohibitively expensive here though lol.

48. noddybear ◴[11 Mar 25 13:44 UTC] No.43332371{4}[source]▶

>>43332277 #

I'm reminded of Hitchhikers Guide to the Galaxy, where the whole planet is one big computer. I would expect that any model directly implemented in Factorio would take up most of the game-world.

49. HPsquared ◴[11 Mar 25 13:44 UTC] No.43332373[source]▶

>>43331894 #

I've given up on that a few times. This time will be different, I say.

50. HPsquared ◴[11 Mar 25 13:45 UTC] No.43332386[source]▶

>>43332346 #

LLMs could bring the characters to life on the diplomacy screen. Not sure if Civ is the right game for it, though.

replies(1): >>43333340 #

51. eterm ◴[11 Mar 25 13:48 UTC] No.43332409[source]▶

>>43332084 #

> Building a main bus vs spaghetti belts is one of the obvious examples here

I'm an anti-bus extremist. ( I've even considered regsitering BanTheBus.com and doing an over-the-top static anti-bus website. ), so take what I'm about to say with a pinch of salt. Also note that my post applies to non-space-age. Space-age changes the gameplay fundamentally, so this really only applies to factorio 1.1 or 2.0 with space-age disabled. Gleba in particular breaks the JIT model. ( Fuck Gleba. )

Busses are the opposite of good factorio factories. They undo a lot of the benefits of a healthy just-in-time (JIT) manufacturing, by encouraging massive amounts of buffer (belt-buffer).

They also encourage people to anti-learn fundamental principles. You often people do "starter-busses" with 4 lanes of iron plates, but only fed by one actual belt worth of smelting. Then people look for all kinds of "balancing" solutions to try to alchemize one belt into 4 belts.

They encourage massive amounts of over-spend on expensive splitters to keep "balancing" the bus to make it look neat, over actually just focusing on what needs to be built.

Spaghetti on the other hand is much better for actually getting to the end-goal. Start by placing what you want to build, then look for what it needs. Work out how to feed it by any means necessary. If you dont' have enough input, then build more of that input. Then repeat as necessary.

There's no such thing as too much input. With short belts (even direct insert where possible), buffers are kept to a minimum, and any "overprodction" is stopped at source, because assemblers don't produce if they have nowhere to output into.

The biggest classic beginner mistakes in factorio are:

- Sticking things in chests. Even worse, trying to "maintain production" by picking up those chest contents. ( This comes from an RTS mindset where "idle" workers are a big sin. )

- Trying to increase throughput by replacing yellow belt with red belt when their yellow belt wasn't saturated.

- Looking for guides and discovering "The Main Bus".

That last point is so common, and not only does it take away some of the creativity of the game, but busses are inherently a bad solution that makes all bases look the same, and produces a mediocre result.

Look at how speedrunners are able to complete the game on default settings in sub 2hr30. They're not producing oodles of red belts. They're not producing main busses. They're not even producing railways. They're hyper focused on what's actually needed, which is very little indeed.

replies(9): >>43332549 #>>43332651 #>>43332833 #>>43333131 #>>43333490 #>>43333504 #>>43333743 #>>43335367 #>>43377310 #

52. noddybear ◴[11 Mar 25 13:49 UTC] No.43332414{5}[source]▶

>>43332302 #

Feel free to create an issue in the repo - am totally happy to help however I am able! I think that the only change you'll have to make is to expose your GoFAI framework in the 'Namespace' object which the agents have access to (for them to call it directly). Alternatively you could design a new tool which takes in game objects and generates a solution / typed object output.

53. malfist ◴[11 Mar 25 13:49 UTC] No.43332416{3}[source]▶

>>43332340 #

The biggest cost in factorio by far is the human time cost of setting up logistics, building factories and mining outposts.

To an LLM those probably aren't even costs

replies(1): >>43332487 #

54. PetitPrince ◴[11 Mar 25 13:53 UTC] No.43332471{3}[source]▶

>>43332192 #

As an opponent that would be indeed unfun, but as a sparring partner / coach in a competitive game (fighting game? Rts? Moba? Puzzle game?) that would be useful.

replies(1): >>43334407 #

55. cgannett ◴[11 Mar 25 13:55 UTC] No.43332479[source]▶

>>43331582 (OP) #

I wonder if anyone has done something similar with Dwarf Fortress

replies(2): >>43332541 #>>43332711 #

56. philipwhiuk ◴[11 Mar 25 13:55 UTC] No.43332483[source]▶

>>43331582 (OP) #

It's great to see that LLMs too, struggle with oil production.

replies(2): >>43332516 #>>43332814 #

57. asah ◴[11 Mar 25 13:55 UTC] No.43332487{4}[source]▶

>>43332416 #

seems like we just need to add that 'cost' to the agents...

replies(1): >>43332569 #

58. noddybear ◴[11 Mar 25 13:55 UTC] No.43332496[source]▶

>>43332331 #

Agents don't have access to screenshots, as we are purely evaluating text-only models. All reasoning is conducted over object representations of the game (with positions etc).

I have anecdotally tried using screenshots to help models debug their factories, but without training a custom CNN/ViT on the Factorio UI, the visual outputs miss critical things (e.g gaps in transport belts).

That said, we have demonstrated via unit tests that the API is technically sufficient to progress to a rocket launch alone. We have been able to complete most lab tasks using the API ourselves so the humans still have a hefty lead here! The ones that we didn't do are the late-game lab tasks, which would have taken significant time and which frontier models are far from being able to complete.

replies(1): >>43333191 #

59. ◴[11 Mar 25 13:57 UTC] No.43332516[source]▶

>>43332483 #

60. philipwhiuk ◴[11 Mar 25 13:59 UTC] No.43332527[source]▶

>>43332218 #

Don't worry - people have done it without LLMs already: https://www.reddit.com/r/factorio/comments/p5clik/just_diago...

replies(1): >>43332998 #

61. spieswl ◴[11 Mar 25 14:01 UTC] No.43332539[source]▶

>>43331582 (OP) #

Fantastic idea.

It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.

I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.

Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.

replies(1): >>43332667 #

62. cgannett ◴[11 Mar 25 14:01 UTC] No.43332541[source]▶

>>43332479 #

Or Mindustry

63. noddybear ◴[11 Mar 25 14:02 UTC] No.43332549{3}[source]▶

>>43332409 #

While I agree that busses incentivise over-production, my only argument in their favour stems from SWE design best-practices of maximising cohesion and minimising coupling. The benefits of a bus is that it makes it easier to create factory districts that can be scaled independently of everything else. I suppose the question is how independent/integrated should these districts be? Should they only create ore? Or create utility-science-packs?

64. noddybear ◴[11 Mar 25 14:05 UTC] No.43332569{5}[source]▶

>>43332487 #

We do actually track this. Each action an agent takes consumes 'ticks', which is how long it would take a human player to do it. For example, moving 10 tiles to the right takes something like 300 ticks (5 seconds).

However, our results only loosely indicate that smarter models use fewer ticks. I have an intuition that we will get more signal for tick-efficiency as models improve.

65. martbakler ◴[11 Mar 25 14:10 UTC] No.43332607[source]▶

>>43332331 #

Just to jump in here as one of the authors

We designed the API to be as spatially descriptive as possible (include x-y coordinates and neighbors in game state descriptions) and the agents have tools to aid them in carrying out actions which would benefit from vision (i.e find buildable areas on the map with different sizes, placing entities next to other entities etc).

As Jack said, we completed most of the lab tasks manually ourselves and while it took us a lot longer compared to having vision, the tasks were still doable and the human performance is significantly higher than current agents. We are thinking of supporting vision for future evals but from a small number of tests we ran, current models got even more confused as the number of entities on the map grows quite quickly. This is likely due to VLMs being notoriously bad at visual reasoning on images with lots of detail and in a game where one misplaced entity in a large factory breaks everything, the errors start to compound

replies(1): >>43332849 #

66. zelias ◴[11 Mar 25 14:14 UTC] No.43332651{3}[source]▶

>>43332409 #

I dunno, main bus strikes me as an efficient way to feed all the mall assemblers in the early game before I can scale up to a train based base.

After the early game, totally agreed, ban the bus.

replies(1): >>43333015 #

67. noddybear ◴[11 Mar 25 14:15 UTC] No.43332667[source]▶

>>43332539 #

One thing we've been talking about is creating tasks that are a bit more 'tower defence', where biters are released every X steps / seconds. The idea would be to test agents in building a military-industrial complex. One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)

Regarding layout optimisation benchmarks, we actually discussed this yesterday. I think we need 2 types of layout task: 1) fix this subtly broken factory, and 2) improve the throughput of this factory. These should be straightforward to implement, if you'd like to have a look.

replies(4): >>43333497 #>>43334468 #>>43336302 #>>43337890 #

68. noddybear ◴[11 Mar 25 14:19 UTC] No.43332699[source]▶

>>43331782 #

Thank you. Spent too many hours procrastinating by pushing-pixels for those diagrams!

69. noddybear ◴[11 Mar 25 14:20 UTC] No.43332711[source]▶

>>43332479 #

Not seen one yet, but the ASCII representation of the game would be ideal for an LLM benchmark.

replies(1): >>43336789 #

70. noddybear ◴[11 Mar 25 14:25 UTC] No.43332757{4}[source]▶

>>43332326 #

I experimented a bit using deep search trees to find better Factorio trajectories (MCTS), which worked somewhat well. Unfortunately, it's very computationally expensive, and probably only makes sense in a training context (i.e gathering trajectories to train a model in a supervised setting).

71. noddybear ◴[11 Mar 25 14:31 UTC] No.43332814[source]▶

>>43332483 #

The pipe placement seems to confuse the models especially - as you can't place pipes carrying different fluids next to each other.

72. mrighele ◴[11 Mar 25 14:31 UTC] No.43332816[source]▶

>>43332084 #

> the game is about eventually building thousands of the new item.

I disagree that you need significant amount of thinking ahead. At the beginning spaghetti belt is fine, as you have little resources and you don't have the luxury of overbuilding. Once you start getting "bigger" and into more complex designs you can just leave what you already built how it is and build the new stuff somewhere else.

By the time you need to produce thousands of pieces of an item you can probably prepare a blueprint that builds the whole factory in a click.

My approach to factor.io is built on phasesw

1: build ad hoc infrastructure for the specific material that I need, close to the raw resources

2: prepare blueprints for specific resources, so that if I need more of something I can just build an extra factory. I make the blueprints so that I can compose them, like input belts on one side and output belt on the other. such "factories" are almost self contained, as in they get only a subset of materials (plates, plastic and stuff that involves liquids) and produce all the intermediate materials. This leaves some optimizations on the table, but simplify the logistic. Use trains to fetch resources from far.

3: compose the blueprints of the previous step to make "megafactories" with stations included. While at step 2 input and output of the factories are belts, at this step the input/output are train stations for specific material (with proper names, so I can add a new factory and trains will start delivering materials right away)

Of course my approach is not the only possible and probably not even efficient. I play for fun, with no care for the time it takes, as long as the time spent is enjoyable.

replies(1): >>43333240 #

73. aithrowawaycomm ◴[11 Mar 25 14:33 UTC] No.43332833{3}[source]▶

>>43332409 #

I think the point of “The Main Bus” in guides is that it’s easy for newer players, and thereby takes a lot of complexity off the table when you still haven’t really figured out how petroleum works, or keep falling behind the biters because you underestimated red bullets demand for steel, etc etc etc. Eventually you figure out trains; until then a main bus is an idiot-resistant way to carry resources across the entire base.

74. Hammershaft ◴[11 Mar 25 14:35 UTC] No.43332849{3}[source]▶

>>43332607 #

Someone below mentioned the ASCII interface for dwarf fortress as being ideal for this, and I wonder if that kind of representation with a legend might produce spatially better results. The drawback I see is that elements can be layered on a tile in Factorio, or have properties that are not visually obvious in ASCII, so the llm would need to be able to introspect on the map.

replies(1): >>43332896 #

75. aithrowawaycomm ◴[11 Mar 25 14:38 UTC] No.43332880{3}[source]▶

>>43332340 #

Long-term planning is necessary if you have biters enabled: typically you need to secure territory/resources and invest in defenses before the resources run low and while the biters are still manageable. Otherwise things can get badly out of hand.

Edit: IMO the biggest difference between Satisfactory and Factorio is that Satisfactory has no crises. If a Satisfactory base shuts down it is annoying, but you can dig another miner / / build another plant / etc, entirely at your leisure. But in Factorio, a shutdown is an emergency with a ticking clock.

replies(1): >>43333197 #

76. Hammershaft ◴[11 Mar 25 14:38 UTC] No.43332889[source]▶

>>43331994 #

I'd love to see a Baba is You or Stephen's Sausage Roll llm environment to gauge spatial reasoning. Stephen's Sausage Roll in particular could be very interesting because the mechanics are incredibly simple but challenging.

77. noddybear ◴[11 Mar 25 14:39 UTC] No.43332896{4}[source]▶

>>43332849 #

I think your intuition is correct about the amount of information that needs to be encoded into an ASCII char. You could potentially use unicode to pack more more into each char, e.g direction, type, status etc. Or make each representation available on-demand, i.e 'show me the direction of all inserters in a 10 tile radius'.

replies(1): >>43333241 #

78. barrystaes ◴[11 Mar 25 14:43 UTC] No.43332931[source]▶

>>43331582 (OP) #

I have long dreamt of automating Factorio in the way that HDL and a PCB router works: just specify the ingredients and it produces a Factorio Blueprint.

First MVP stupid designs, then optimized routing, and eventually usable ingame where it connects with provided in/outputs.

Would be more fun to develop than to play obviously..

I liked the nilhouse mega base with that factory-train-blocks blueprints, its basically Factorio DUPLO.

replies(2): >>43333364 #>>43333429 #

79. enragedcacti ◴[11 Mar 25 14:45 UTC] No.43332956{4}[source]▶

>>43332244 #

You should give them a shot again with 2.0, they made a ton of improvements to the track building:

https://factorio.com/blog/post/fff-377, https://factorio.com/blog/post/fff-389, https://factorio.com/blog/post/fff-403

80. Python3267 ◴[11 Mar 25 14:45 UTC] No.43332962[source]▶

>>43331582 (OP) #

The factory must grow.

81. sc68cal ◴[11 Mar 25 14:48 UTC] No.43332998{3}[source]▶

>>43332527 #

oh i'm aware. Trupen did a big video on it I think

82. eterm ◴[11 Mar 25 14:51 UTC] No.43333015{4}[source]▶

>>43332651 #

Early game, I don't use trains either, I just spaghetti yellow belt around! I even run long belt lines to the first ring of mining outposts. It usually only costs ~500-1000 belt to do so, which sounds like a lot but really isn't much.

Advantage of yellow belt:

- Don't consume power - Don't produce pollution ( don't attract bugs ) - Are cheap. Rails themselves aren't too bad, costing only slighty more per tile than belts. But rails need stations, signals, chests, inserters for loading / unloading. Often combinators too. All add up to much higher investment, and critically in the early game, add up to more pollution produced before the pay-off of improved resources incoming. - Don't need lots of belt for the stations. A typical loading/unloading station can often have so much belt for "efficient" (fast) loading/unloading that you could have belt half your way to where you're going just laying that belt in a straight line.

If you're going further than the first ring of extra resources, then trains are amazing, but there are more than enough resources several times over in the first ring of resources to get a rocket launched. The first expansion ring tends to have 500k-1Mil per patch, and have several patches, so there's no need to go miles out pre-rocket.

It takes surprisingly little raw resources to actually launch a rocket. Someone did the maths once and calculated completing the game as needing a minimum of ~500k Iron ore. At 15/s, a single yellow belt can deliver that in under 10 hours, way below the time of a typical playthrough. This technically means a single smelting line is all you need to actually complete the game still in a reasonable time. Of course, trying to do so from a single lane would be extremely painful, and need a lot of attention to preventing over-production of intermediates, especially when it's much easier to make a few smelting lanes. I'm not recommending that, but I am recommending just slapping down new lanes and production wherever you feel like it and whenever you need, rather than pre-planning 4-lanes of plates that aren't actually useful of efficient.

Unless you mean your "main bus" is just 1x copper and 2x iron lane, in which case fair enough, but when I attack the concept of the bus, that's not what I'm railing against. What I'm railing against is the design pattern where people put down 4 lots of 4-wide lanes to bus far more iron, copper and intermediates than will ever be needed to actually launch a rocket.

Busses aren't efficient by any metric, other than minimising personality, creativity and thinking.

With a bus you just follow the same template done before, it'll get you to the end, but it teaches bad habits and isn't efficient. It isn't quick either, you'll spend a long time putting down lanes and splitters and undergrounds. All of which need producing.

For late-game (post-rocket), then either a bus design or train / city-block designs can work well. I prefer trains, but large busses have their place for mega-bases.

But for pre-rocket, or anyone starting out, spaghetti is absolutely the way to go. It'll also better teach you via your own mistakes.

83. vessenes ◴[11 Mar 25 14:53 UTC] No.43333045[source]▶

>>43331582 (OP) #

OK, You’ve permanently nerd-baited me, and I wish to apply for a job at the Anthropic Factorio lab immediately.

I can’t tell from the paper or these comments if you’re sending multimodal data back — I’m guessing no, because many of these models aren’t multimodal. But some are — and of course we now have recently released Qwen 2.5 VLM which seems to be quite strong for its size.

You harp on this lack of spatial ability a fair amount, which - fair enough - and you mention difficulties in both planning and spatial planning. Are you sending images back? If not, any thoughts on this?

Thanks for this amazing bit of work, I really am reorganizing my day to play with it now.

P.s. seems like MCP enabling the python library is a natural must-do so that all tool-enabled LLMs everywhere can play factorio.

replies(3): >>43333278 #>>43333554 #>>43339075 #

84. vessenes ◴[11 Mar 25 14:59 UTC] No.43333131{3}[source]▶

>>43332409 #

Lots of ways to play. That said, there is a significant amount of tooling to support the style of “stamping” working units down. In fact, I’d argue it’s a key part of the game in that you start with small working units, and the game helps you scale them up to midsize (say on the map view you can still see the parts) and then again to large scale (you can only see blocks on the map). This is why the blueprint tool allows offsetting and grid refinement.

So, I almost never build a large main bus style base, but it’s a fun (and helpful) part of the game’s design and tooling to allow you to create very large working systems out of a variety of component scales — “buses” enable this, and make it MUCH easier to implement.

85. vessenes ◴[11 Mar 25 15:04 UTC] No.43333191{3}[source]▶

>>43332496 #

Noddy, what’s “fair game” for this benchmark? e.g. do you wish to provide frontier models with a text goal, tooling info, and leave it at that? Or do you wish to have agent architectures compete? It seems to me like tiering the goal setting, layout and implementation are all separate tasks that would benefit from different agents.

replies(1): >>43343089 #

86. martbakler ◴[11 Mar 25 15:05 UTC] No.43333197{4}[source]▶

>>43332880 #

We were thinking of creating a minigame resembling a "tower-defense" setting, where waves of bugs get released and the agent needs to create appropriate defenses. It would be interesting to see if agents are capable of defending the base and how much resources would they put towards defenses in a normal game where enemies are enabled

87. Philpax ◴[11 Mar 25 15:07 UTC] No.43333224[source]▶

>>43332084 #

> It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set.

I'm not sure we can conclude the game's not in their training set - a lot has been written about Factorio on the internet, and a lot has been recorded; the newer multimodal LLMs have trained on YouTube, especially Gemini.

replies(1): >>43334140 #

88. bombcar ◴[11 Mar 25 15:09 UTC] No.43333240{3}[source]▶

>>43332816 #

You can certainly build with a main bus, and segmented factories doing what they do in perfect Nilaus city blocks. It's quite like perfectly designed and planned code; though you run the risk of it becoming just a blueprint plopping game.

But it (for me at least) is so much more fun building the spaghetti and making things work, refactoring as you go, and expanding organically.

89. vessenes ◴[11 Mar 25 15:09 UTC] No.43333241{5}[source]▶

>>43332896 #

Well we learned last month on HN that you can encode arbitrary data into Unicode; anecdotally, o3-mini-high at least could decode it if given instructions.

I wonder what a quick way to calculate how many Unicode characters you’d need is.. I guess every entity + four orientations. Underground belts and pipes seem tough. But I guess you could just add an encoding showing if the square has an underground pipe or encoding.

I propose this would work. I think I’ll give it a try today.. I’d love dwarf fortress factorio. That said, the encode/decode phase seems like a lot of tokens for a model that’s not trained to understand the Unicode ‘map’. Seems like you’d want to fine tune something at least. Maybe a layout model.

replies(1): >>43333393 #

90. kortilla ◴[11 Mar 25 15:11 UTC] No.43333259[source]▶

>>43332084 #

A bus is absolutely not needed. I played through many early games and got to the launch without building a main bus.

91. martbakler ◴[11 Mar 25 15:12 UTC] No.43333278[source]▶

>>43333045 #

Currently it's a text-only modality environment but we are planning to support vision in the future. We did run a couple of tests and saw that including screenshots of the game state did not improve performance on the off-the-shelf models. As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because the VLMs currently aren't good at spatial reasoning in high-detailed images, likely this would improve significantly with finetuning

Good point with MCP as well given it has been blowing up lately, we'll look into that!

replies(2): >>43333560 #>>43334032 #

92. ◴[11 Mar 25 15:15 UTC] No.43333312[source]▶

>>43331582 (OP) #

replies(1): >>43333350 #

93. dkural ◴[11 Mar 25 15:17 UTC] No.43333333[source]▶

>>43331582 (OP) #

So the battle of AGI will be won on the playing-fields of Factorio and StarCraft..

replies(1): >>43333342 #

94. pferde ◴[11 Mar 25 15:18 UTC] No.43333340{3}[source]▶

>>43332386 #

There is a guy on Twitch that is playing a Crusader Kings 3 game where (some of the) NPC characters are "played" by chatgpt, in that he gave instructions to it regarding what traits the character has, and has conversations in English with it. And somehow translates the results into the game, probably using some sort of a mod.

I don't remember the details, as I've only watched for a few minutes and found the whole thing boring, but he seems to have been going at it for a few weeks now, so it's probably working on some level.

95. ◴[11 Mar 25 15:18 UTC] No.43333342[source]▶

>>43333333 #

96. ◴[11 Mar 25 15:19 UTC] No.43333347[source]▶

>>43332084 #

97. ◴[11 Mar 25 15:19 UTC] No.43333350[source]▶

>>43333312 #

98. ◴[11 Mar 25 15:20 UTC] No.43333353[source]▶

>>43332084 #

99. ◴[11 Mar 25 15:20 UTC] No.43333355[source]▶

>>43331894 #

100. noddybear ◴[11 Mar 25 15:22 UTC] No.43333364[source]▶

>>43332931 #

Someone created this Terraform provider for Factorio a few years ago: https://registry.terraform.io/providers/efokschaner/factorio...

I think this could be a good starting point for what you describe! This stuff is always more fun to develop than play. Since started working on this project, I can't bring myself to play the core game myself...

101. noddybear ◴[11 Mar 25 15:24 UTC] No.43333393{6}[source]▶

>>43333241 #

Checkout the 'ObserveAll' tool in the repo - its deprecated now, but it pipes all the raw entities on the map back to the agent. You could procedurally convert it to unicode format given a pre-defined codebook (which you give to the agent) before letting the agent observe and reason over it.

102. delichon ◴[11 Mar 25 15:27 UTC] No.43333429[source]▶

>>43332931 #

I've wondered if automating Factorio would free me of the compulsion to play it.

replies(2): >>43333520 #>>43342599 #

103. infogulch ◴[11 Mar 25 15:33 UTC] No.43333490{3}[source]▶

>>43332409 #

I agree with everything you said, but I'd like to defend the main bus as a learning phase.

The main bus is for production capacity discovery. New players have to discover the production tree somehow and browsing recipe cards with a calculator before starting is not the way, it has to be hands-on and interactive. Dividing up your factory into (arbitrarily defined) intermediates production and final product production is helpful to cut through the complexity. Of course, intermediate production will be severely underbuilt and logistics capacity will never keep up with consumption of all of the final product builds, but it will be enough to run a couple builds at once and test new ones.

Once new players have a grasp of the scope of the production tree, with tangible examples of sub-factories, then they can begin to consider end-to-end production capacity.

104. robotresearcher ◴[11 Mar 25 15:34 UTC] No.43333497{3}[source]▶

>>43332667 #

If (1) is a special case of (2), maybe you’d only need (2)?

replies(1): >>43334105 #

105. sdwr ◴[11 Mar 25 15:35 UTC] No.43333504{3}[source]▶

>>43332409 #

The most challenging part of factorio is routing inputs and outputs in 2D. If you build naively, intersections between ingredient belts choke the life out of your factory (3 utilities problem). Bus guarantees building space with easy access to all ingredients.

Sure, if you already know what you're doing, you can plan ahead and leave space for the right things. For everyone else, it simplifies the game to a manageable level.

106. noddybear ◴[11 Mar 25 15:36 UTC] No.43333520{3}[source]▶

>>43333429 #

It certainly did with me.

107. jillyboel ◴[11 Mar 25 15:38 UTC] No.43333554[source]▶

>>43333045 #

Why would screenshots be necessary if a textual description of the factory state is both easier to interpret and less prone to confusion? The game is played on a grid, so converting the game state to ascii ought to be trivial.

replies(2): >>43333584 #>>43334007 #

108. vessenes ◴[11 Mar 25 15:39 UTC] No.43333560{3}[source]▶

>>43333278 #

That makes sense and it’s really interesting - it is a challenging visual test for sure; thousands of entities, either multi tier visual representations (screen, map, overview map) or a GIANT high res image. I hereby propose FLE-V a subset benchmark for visual models where they just turn a factorio image into a proper FLE description. And maybe the overview and map images as well.

replies(2): >>43333720 #>>43334019 #

109. vessenes ◴[11 Mar 25 15:41 UTC] No.43333584{3}[source]▶

>>43333554 #

Trivial as in only engineering work, sure. But it’s a lottt of tokens. Long context models do a number of things to get all that working context in; some of those things elide details / compress / have token segments that are harder to reason about. When a burner inserter at a location takes up like 50-100 tokens, and you want it to reason about 100 of them, this is still a pretty challenging task for any LLM.

replies(1): >>43333621 #

110. jillyboel ◴[11 Mar 25 15:44 UTC] No.43333621{4}[source]▶

>>43333584 #

Ah, I don't know much about multi modal models but I wonder what they'd think of pixel art representing the factory where each pixel is a point on the grid and each color is a specific entity, perhaps ignoring things such as bots flying about. Probably easier to comprehend than an actual screenshot?

replies(2): >>43333738 #>>43334166 #

111. kridsdale1 ◴[11 Mar 25 15:54 UTC] No.43333720{4}[source]▶

>>43333560 #

Such research could have hundreds of billions of dollars in downstream GDP implications when applied to real industrial settings.

replies(2): >>43334227 #>>43337542 #

112. kridsdale1 ◴[11 Mar 25 15:56 UTC] No.43333738{5}[source]▶

>>43333621 #

I mean at some point you compress the board state down to Dwarf Fortress with an extended ASCII representation for each grid-state (maybe 2 bytes each?)

replies(1): >>43334261 #

113. tavavex ◴[11 Mar 25 15:56 UTC] No.43333743{3}[source]▶

>>43332409 #

I've played other factory-building games, but not Factorio, so I'm not familiar with the bus-building paradigm. I feel like you're saying that buses would incentivize bad practices, but at the same time I don't see what would make them inherently bad. Whenever I saw screenshots of Factorio, I thought that buses were more of a logistics tool, a way to cable-manage the delivery of stuff from one place to another. Is this wrong? I feel like, if you have more consumers than producers (and end up having to rely on buffering), then you've got a big problem regardless of whether you have a bus or not - a sufficiently long belt from an ore deposit etc could replicate the big-buffer problem in the same way. I don't think I'd use buses, I like a bit of chaos, but still, I'm not sure if they're that bad.

replies(1): >>43334282 #

114. htrp ◴[11 Mar 25 16:06 UTC] No.43333869[source]▶

>>43332346 #

too expensive... you'd have to pay a recurring monthly sub for the game content.

Inworld's been doing this but haven't seen what they've done recently. https://inworld.ai/blog/inworld-stardew-valley-ai

115. martbakler ◴[11 Mar 25 16:17 UTC] No.43334007{3}[source]▶

>>43333554 #

It actually is engineering wise quite trivial but the underlying question is which modality is the best to elicit spatial reasoning capabilities from the current general models. We tried (very anecdotally) a couple of months ago to get an agent to reason over a couple of ascii representations of factories and the results weren't very promising. It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens

The question is what is the most efficient and high-quality representation we could use to improve that

replies(2): >>43336234 #>>43337687 #

116. ◴[11 Mar 25 16:18 UTC] No.43334019{4}[source]▶

>>43333560 #

117. grayhatter ◴[11 Mar 25 16:19 UTC] No.43334032{3}[source]▶

>>43333278 #

> As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because [...]

I think you just described a research paper that would advance sota. Less describing why, but how. (Assuming it's not just, wy finetuned the model and it worked perfectly)

replies(1): >>43334820 #

118. noddybear ◴[11 Mar 25 16:25 UTC] No.43334105{4}[source]▶

>>43333497 #

True - although it might be interesting to benchmark them both, as (1) is more about debugging (something that these agents spend a lot of time doing).

119. Starlord2048 ◴[11 Mar 25 16:28 UTC] No.43334138[source]▶

>>43331582 (OP) #

[flagged]

replies(3): >>43334429 #>>43334450 #>>43335187 #

120. noddybear ◴[11 Mar 25 16:29 UTC] No.43334140{3}[source]▶

>>43333224 #

The models have some understanding of the game and the initial build orders that they should adopt. What they don't remember is: 1) How many resources each item costs to make 2) Spatial gotchas such as a belt requiring an inserter to load/offload from.

121. noddybear ◴[11 Mar 25 16:31 UTC] No.43334166{5}[source]▶

>>43333621 #

The thing is that when you create a dense ASCII representation, any gain you might make from the spatial relationships is lost by: a) the tokeniser not working on characters alone (remember strawberrry), and b) the increased number of 'dead' tokens encoding not very much.

Our sparse encoding seems to confuse the models less - even though it certainly isn't perfect.

122. iliketrains ◴[11 Mar 25 16:33 UTC] No.43334194[source]▶

>>43331582 (OP) #

This is awesome! I like the idea of abstracting the factory building with a code-like structure. I wonder if supplemental 2D image (mini-map style) as an input to the policy would help with the spatial reasoning?

I work on a similar factory game (Captain of Industry) and I have always wanted an agent that can play the game for testing and balancing reasons. However, pixels-to-mouse-actions RL policy (similar to Deep Mind's StarCraft agent) always seemed like a very hard and inefficient approach. Using code-like API seems so much better! I might try to find some time to port this framework to COI :) Thanks for sharing!

replies(2): >>43334413 #>>43334470 #

123. vessenes ◴[11 Mar 25 16:36 UTC] No.43334227{5}[source]▶

>>43333720 #

Well I better get training!

124. infogulch ◴[11 Mar 25 16:36 UTC] No.43334233[source]▶

>>43331582 (OP) #

Interesting to see only a handful of complex scenarios. I've always suspected ML game agents need hundreds of tiny puzzles with hundreds of variations each to learn game mechanics properly. Like:

    The factory is not powered, place the missing power pole(s)
    The factory is missing items, place the missing belt(s)
    Craft and place these 200 assembly machines
    The assembly machine is not running for some reason, fix it
    The factory production is too low, double it
    Get to this other point in the factory as fast as possible
    Fix the brownout
    All of the above with and without bots

Programmatically generating a few thousand example scenarios like these should be relatively easy. Then use it like an IQ test question bank: draw a dozen scenarios from the bank and evaluate performance on each based on time & materials used.

I hypothesize that ML agents learn faster when evaluated on a sample from a large bank of scenarios of smoothly increasing complexity where more complex scenarios are presented after it scores sufficiently high on lower complexity scenarios.

replies(2): >>43334396 #>>43334754 #

125. vessenes ◴[11 Mar 25 16:38 UTC] No.43334261{6}[source]▶

>>43333738 #

Lots of questions here - you need item, orientation, info about pipes (2 directions) , belts (3 or 4 colors x2 directions). Do you wish Circuits?

126. eterm ◴[11 Mar 25 16:40 UTC] No.43334282{4}[source]▶

>>43333743 #

Buffer is bad precisely because it increases the lag between having a gap in provision, and that gap being obvious to the player.

With no buffering, as soon as your demand for steel is greater than than production of steel, then the bottleneck is immediate and obvious. The solution is also immediate and obvious: Build more steel.

Buffering, in particular belt buffering, and in particular busses, contribute to mask this issue. There can be a great delay between increasing consumption above production, and so the root cause can be very hidden. It may also be that the ultimate root cause is that steel production is low because it's limited on how much iron ore it gets. If everything is bussed, then it can be hours before resource constraints are hit, by which point it's very hard to see what's happened to cause the shortage, and also by which time the factory may have expanded further.

It also constributes to see-saw production, where-by a shortage in one area causes a pause, which allievates the root cause shortage for a while by backing up other production. The longer the lag between cause and effect, the greater the banding effect, further masking the root cause.

A bus also encourages bottom-up, which further encourages massive over-consumption of base resources. If you start building green chip production and bussing it, the bus may will fill and buffer, making it look like you've got plenty of green chips. In turn, as the green chip production stops, it'll look like you've got plenty of iron plates. You'll build all your malls and other production, satisfied as you build each that they run fine and are not over-consuming.

Only when later everything starts to run at once you realise that the stuff down the end of the line is getting scant resources, as previously each part was running in isolation before serious amounts were required.

In contrast, a top-down approach involves building the final result first, then at each step building what's needed to feed it. This ensures that there is always enough provision, and everything can be placed to minimise buffer to reduce lag and improve feedback time on problems. It also reduces pollution since any item on a belt represents inventory for which you've paid a pollution cost but not got any final results from yet.

The spaghetti approach can lead to "under-utilised" buildings, such as smelting array that ends up only needing to supply 0.3 of a belt. But in factorio space is almost endless, and there's little to no cost to idle buildings. The power drain of idle assemblers, particular the bare (no module) level 2 buildings you'll likely be building before end-game, is extremely low.

For late game post-rocket, this changes of course. With beacons and level 3 assemblers with modules, the idle draw is significant, and you may want to optimise ratios and look to eliminate how many assemblers you run idle. ( That said, power is almost non-issue in 2.0 with nuclear power being much easier to run efficiently than previously, so the large solar fields aren't really needed anymore. )

Busses have a strong visual appeal, but unlike "cable management", there's no airflow to consider in factorio. A messy spaghetti base isn't inherently inefficient. It doesn't affect productivity to just run short belts all over.

The visual temptation of the mega-bus is clearly alluring, it looks good on youtube video guides.

replies(1): >>43334623 #

127. onehair ◴[11 Mar 25 16:43 UTC] No.43334329[source]▶

>>43331582 (OP) #

Hi Jack, just reached 85/88 achievements in Space Age. Seeing an article about computer science and Factorio at this stage is either the nicest romantic gesture or very cruel intended to keep me playing this beautiful game forever.

128. noddybear ◴[11 Mar 25 16:47 UTC] No.43334396[source]▶

>>43334233 #

I think generating the scenarios as you suggest (in text) is easy, but creating correct factory game states to start from is a lot harder. AFAIK it reduces into the same manual task of designing an init state and a task to complete.

replies(1): >>43334572 #

129. ◴[11 Mar 25 16:48 UTC] No.43334407{4}[source]▶

>>43332471 #

130. noddybear ◴[11 Mar 25 16:48 UTC] No.43334413[source]▶

>>43334194 #

Thats really cool! Please shoot me an email if you get around to doing this, always happy to assist however I can.

131. ◴[11 Mar 25 16:49 UTC] No.43334422{3}[source]▶

>>43332192 #

132. andai ◴[11 Mar 25 16:49 UTC] No.43334429[source]▶

>>43334138 #

Fascinating. I was thinking how the factory should be communicated to the model, and represented "internally". Images aren't the right solution (very high bandwidth for no real benefit). An ASCII grid of the game's tiles (more likely, a small chunk of it) is orders of magnitude better, but you still don't need to simulate every tile in a conveyor. It's just a line, right? So the whole thing is actually a graph!

That compresses nicely into text, I imagine.

I'd like to hear more details about your symbolic approach!

replies(2): >>43334643 #>>43335562 #

133. noddybear ◴[11 Mar 25 16:50 UTC] No.43334450[source]▶

>>43334138 #

This is really interesting, do you have a repo or anything describing the approach? I would be particularly interested in trying your approach in FLE to see how it affects layout design. How are you performing the spatial reasoning?

134. spieswl ◴[11 Mar 25 16:52 UTC] No.43334468{3}[source]▶

>>43332667 #

Love the suggestion, I'll clone it down and start poking around.

I believe your intuition about layout experiments needing to be of different genres is correct. I think you could have a pretty wide range of debugging opportunities (imbalanced belts, inserters fighting for items, insufficient power at full load leading to throughput loss, etc) for the first. The second feels like it would be nicely encapsulated by focusing on optimizing for ratios, although seeing an agent realize that they can get more throughput by simply copy/pasting a block and upgrading a belt would be pretty wild (depending on the recipe, of course). Maybe nuclear power / heat exchanger ratios are a bit too far down the path, but optimizing for copper cable use in green circuits is pretty important and fairly early in the tech tree?

135. noddybear ◴[11 Mar 25 16:52 UTC] No.43334470[source]▶

>>43334194 #

Regarding the 2d image - the issue is that these frontier models don't tend to support supplemental image inputs, and the ones that do aren't sufficiently well trained on (high precision) Factorio visuals to add that much information.

replies(1): >>43335458 #

136. infogulch ◴[11 Mar 25 16:59 UTC] No.43334572{3}[source]▶

>>43334396 #

Yes each scenario will need someone to design it, but you can get a lot of mileage out of each. E.g. consider the "place the missing power pole" scenario: manually build a factory with a few dozen machines connected to a couple steam engines with 20 power poles, then you can generate 400 playable puzzles/scenarios by deleting 1-2 power poles from the working starting point. Humans would find all of these to be equivalent, but I think agents need the explicit variation to learn the lesson properly.

replies(1): >>43343072 #

137. scottmsul ◴[11 Mar 25 17:00 UTC] No.43334576[source]▶

>>43331582 (OP) #

There was a HN post here not too long ago about a team that used reinforcement learning to train an agent to beat pokemon red. They mentioned how they had to tweak the cost function to give small rewards for exploring and big rewards for completing "essential tasks" like beating gyms.

I wonder if this same approach could be used here in factorio? Using the pokemon red analogy the main "essential tasks" in Factorio are setting up automation for new items and new science packs. I think a good reward function could involve small rewards functions for production rates of each item/sec, medium rewards for setting up automation for new items, and big rewards for automating each new science pack.

Telling a Factorio agent to just "make a big factory" is like telling a pokemon red agent to just "beat the game", it has to be broken down into smaller steps with a very carefully tuned reward function.

Thinking about this is really making me want to jump into this project!

replies(5): >>43334684 #>>43334703 #>>43336513 #>>43337120 #>>43343134 #

138. Imnimo ◴[11 Mar 25 17:02 UTC] No.43334618[source]▶

>>43331582 (OP) #

Another category of "Lab Play" task I'd be interested in seeing is balancer design. Even small balancers can be quite complicated (https://factorioprints.com/view/-NopheiSZZ7d8VitIQv9), and it would be interesting to see how models do at designing and troubleshooting them.

replies(1): >>43335947 #

139. zoogeny ◴[11 Mar 25 17:02 UTC] No.43334620[source]▶

>>43331582 (OP) #

This reminds me of how I'll judge LLM ability when it can compete in a 1v1 dark souls battle. Even Factorio is pretty slow compared to what is possible in modern games.

replies(1): >>43334801 #

140. tavavex ◴[11 Mar 25 17:03 UTC] No.43334623{5}[source]▶

>>43334282 #

That makes sense. I guess I just had a different approach when I played the other games. The way I organized in other factory games is that the considerations of input and output were things that I thought of upfront - I never eyeballed and then tried to estimate the production speed based on how fast my resources were drained. I might be overplanning, or maybe Factorio encourages a far more chaotic approach, but I always treated factories as black boxes that take X/s of certain items and outputted X/s results. Knowing precisely how many items per second I have on any individual belt is the most essential piece of knowledge to me, so I never relied on buffering and always made sure to build consumer factories that never overwhelmed producer factories. This means that the visual indication of the buffer draining would only signal some building mistake to me, rather than a design mistake.

141. HideousKojima ◴[11 Mar 25 17:04 UTC] No.43334643{3}[source]▶

>>43334429 #

>An ASCII grid of the game's tiles (more likely, a small chunk of it) is orders of magnitude better, but you still don't need to simulate every tile in a conveyor. It's just a line, right? So the whole thing is actually a graph!

Until you accidentally feed a different material into your belt and need to clean it up

142. scottmsul ◴[11 Mar 25 17:08 UTC] No.43334684[source]▶

>>43334576 #

Also I should add, being a Factorio veteran with 2-3k hours in this game, I think the goal of making the "largest possible factory" is too vague and not the right metric. When Factorio players make large megabases, they don't go for "size" per se, but rather science research per minute. The metric you should be telling the agents is SPM, not "largest" base!

replies(2): >>43337182 #>>43337228 #

143. noddybear ◴[11 Mar 25 17:09 UTC] No.43334703[source]▶

>>43334576 #

In FLE, you have access to milestones representing the first time a new entity was created, but coming up with a stratification of rewards for different degrees of automation would be really interesting. Join us!

144. martbakler ◴[11 Mar 25 17:14 UTC] No.43334754[source]▶

>>43334233 #

We are thinking of something like this (a curriculum approach) for further training. The reason why we didn't want to do this for current work, where the emphasis is on evaluations, is that the "difficulty level" of different tasks is quite subjective and hence we would need to make arbitrary decisions that could affect the evals (i.e which tasks would follow which scenarios, how to ensure sufficient coverage across all difficulty levels etc)

replies(1): >>43335321 #

145. chompychop ◴[11 Mar 25 17:18 UTC] No.43334801[source]▶

>>43334620 #

It's edging pretty close to this - multimodal LLMs can already beat early bosses in soulslike games: https://arxiv.org/html/2409.12889v2

146. martbakler ◴[11 Mar 25 17:19 UTC] No.43334820{4}[source]▶

>>43334032 #

Sounds almost like a visual "needle in a haystack" type of work, that could be quite interesting!

replies(1): >>43338600 #

147. devit ◴[11 Mar 25 17:21 UTC] No.43334839[source]▶

>>43331582 (OP) #

Seems like it might be more effective to use the LLMs to write a program that plays Factorio rather than having them pick the next action given a game state.

Also in general I think the issue with Factorio is that you can just find an "optimal" factory design and build order and just follow it every time; perhaps starting with a suboptimal building layout already present and restrictions like being unable to change them or build others of the same type could help.

replies(1): >>43334956 #

148. noddybear ◴[11 Mar 25 17:32 UTC] No.43334956[source]▶

>>43334839 #

This is exactly how FLE works, the agent writes a program that executes its policy.

I think you bring up a good point, we could create tasks where the goal is to optimise a static factory, starting from a kernel of functionality like 'steam engine power supply' etc.

replies(1): >>43337137 #

149. mlsu ◴[11 Mar 25 17:52 UTC] No.43335187[source]▶

>>43334138 #

Yes!

The way I think of it is this. Yes, the LLM is a "general reasoner." However, it's locked in a box, where the only way in and out is through the tokenizer.

So there's this huge breadth of concepts and meanings that cannot be fully described by words (things like, spatial reasoning, smells, visual relationships, cause/effect physical relationships etc). The list of things that can't be described by words is long. The model would be capable of generalizing on those, it would optimize to capture those. But it can't, because the only thing that can fit through the front door is tokens.

It's a huge and fundamental limitation. I think Yann Lecunn has been talking about this for years now and I'm inclined to agree with him. This limitation is somewhat obscured by the fact that we humans can relate to all of these untokenizable things -- using tokens! So I can describe what the smell of coffee is in words and you can immediately reconstruct that based on my description, even though the actual smell of coffee is not encoded in the tokens of what I'm saying at all.

150. gglon ◴[11 Mar 25 17:54 UTC] No.43335216[source]▶

>>43331582 (OP) #

I was thinking, to build a large, efficient factory autonomously, one could use LLM as a high level agent that is using specialized tools. The overall strategy would perhaps look like following:

1. create a (intermittent) goal for a resource production

2. create a factory graph with calculated number of machines and number of resources required to transport between them. This would be done by using linear programming (factorio calculator)

3. somehow map the resulting graph to a hardware description language. Such that each entity would be mapped to unique logic component. And each transport lane would be mapped to a unique wire (most difficult)

4. compile to 2d FPGA layout using all the VLSI algos like partitioning, routing (hdl compiler)

5. map the resulting plan back to a concrete factorio design

replies(1): >>43339247 #

151. infogulch ◴[11 Mar 25 18:04 UTC] No.43335321{3}[source]▶

>>43334754 #

"a curriculum approach" is a nice way to put it!

> the difficulty level of different tasks is subjective

That makes sense. I wonder if difficulty of different scenarios could be derived by assuming a partial ordering and ranking based on training rate: e.g. it preforms better at scenario T if it trains scenario A first, but training scenario first B doesn't help with T. Then infer A < T, and B ? T.

152. lupusreal ◴[11 Mar 25 18:09 UTC] No.43335367{3}[source]▶

>>43332409 #

I delete my bus as soon as I get bots (in favor of a train base feeding a bot mall), but I've found that a small and not overly strict bus is the fastest way, for me, to get bots unlocked.

153. owenpalmer ◴[11 Mar 25 18:13 UTC] No.43335404[source]▶

>>43331582 (OP) #

> All models exhibited limitations in spatial planning when constructing multi-section factories. Common failures included placing entities too close together, not allocating space for connections, or incorrect inserter placement

It makes sense why LLMs are bad with spatial reasoning. Not a lot of training data for it. I wonder what additional reasoning abilities will emerge when spatial reasoning is solved.

replies(1): >>43338987 #

154. iliketrains ◴[11 Mar 25 18:18 UTC] No.43335458{3}[source]▶

>>43334470 #

I see, integrating image inputs can be very challenging in this case as the models work with text input. I was not even thinking about the full isometric image, but just some simple 2D map where each pixel can be color-coded based on the entity type. I guess the problem is that these maps would look like nothing the models were trained on, so as you say, it might not provide any value.

The reason I was suggesting this is that I worked in robotics making RL policies, and supplying image data (be it maps, lidar scans, etc.) was a common practice. But our networks were custom made to ingest these data and trained from scratch, which is quite different from this approach.

replies(1): >>43337219 #

155. johnisgood ◴[11 Mar 25 18:26 UTC] No.43335537[source]▶

>>43331582 (OP) #

Claude wins.

156. nostrademons ◴[11 Mar 25 18:28 UTC] No.43335562{3}[source]▶

>>43334429 #

Probably the memory model of the game itself is the best representation. The devs have already spent a significant amount of development cycles optimizing this down to a minimal compressed form - belt runs, for example, are one entity regardless of how long they are. The LLM is then effectively modeling the degrees of freedom of the game simulation and picking code paths within them.

157. bocklund ◴[11 Mar 25 18:31 UTC] No.43335590[source]▶

>>43331582 (OP) #

Is there an additional arXiv paper? The abstract referenced at the bottom refers to a completely different paper.

replies(1): >>43335790 #

158. andbberger ◴[11 Mar 25 18:49 UTC] No.43335768[source]▶

>>43331582 (OP) #

stream this on twitch

159. noddybear ◴[11 Mar 25 18:51 UTC] No.43335790[source]▶

>>43335590 #

We’re still waiting for arXiv, but the pdf is at the top.

160. fragmede ◴[11 Mar 25 19:07 UTC] No.43335947[source]▶

>>43334618 #

someone approached that problem with a more traditional SAT solver

https://github.com/R-O-C-K-E-T/Factorio-SAT

161. mNovak ◴[11 Mar 25 19:10 UTC] No.43335980[source]▶

>>43331582 (OP) #

Is there a human-play benchmark (even informally) for this style of interface? Not saying it's necessary or even relevant, I'm just curious to know what programmatic Factorio feels like -- I imagine spatial reasoning around text prompts would be fairly challenging for human players to navigate as well.

replies(1): >>43336201 #

162. fragmede ◴[11 Mar 25 19:16 UTC] No.43336036{3}[source]▶

>>43332192 #

Civilization (VII just released) is famous for having the harder difficulties be harder because the AI cheats. If the game was harder because the AI was smarter instead of it cheating, it would be worth it to players to upgrade!

163. sonofhans ◴[11 Mar 25 19:35 UTC] No.43336201[source]▶

>>43335980 #

Human benchmarks for Factorio are speed runners — rushing to launch the first rocket. The current record is just over 4 hours for one player, and 90 minutes for a team. You can see just from that that a multi-tasking LLM has room to outperform humans.

replies(2): >>43337924 #>>43338047 #

164. ajcp ◴[11 Mar 25 19:38 UTC] No.43336234{4}[source]▶

>>43334007 #

Did you try providing 2D vectors of where each object relates to every other object? Seems like the most obvious way.

In my experience the current generation of models are very poor at spatial reasoning even when given accurate coordinate based location assignments of each object. But I suspect when a model can build the whole relationship of all objects by being given those spatial relationships in a vector they will be much better.

replies(1): >>43337086 #

165. tomrod ◴[11 Mar 25 19:44 UTC] No.43336302{3}[source]▶

>>43332667 #

So something like PvZ might work, right?

166. svilen_dobrev ◴[11 Mar 25 20:05 UTC] No.43336471[source]▶

>>43331980 #

some time ago i mentored a team in Ireland, and one mid-age guy was switching professions - from construction... and he approached programming (entry level python mostly) as puzzle solving. Match the pieces in (sufficiently) proper places.

Coding seems very close to puzzle solving in this regard.

167. idiotsecant ◴[11 Mar 25 20:06 UTC] No.43336487[source]▶

>>43331582 (OP) #

...And that was how a bunch of very enthusiastic HN factorio fans inadvertantly started the exponentially expanding intelligence that consumed the universe to make more science cubes.

168. ainiriand ◴[11 Mar 25 20:08 UTC] No.43336510[source]▶

>>43331582 (OP) #

Wait are you basically telling me that now I can play factorio with code?

replies(1): >>43336908 #

169. mclau156 ◴[11 Mar 25 20:09 UTC] No.43336513[source]▶

>>43334576 #

The same approach could be used in life

170. TeMPOraL ◴[11 Mar 25 20:39 UTC] No.43336789{3}[source]▶

>>43332711 #

+/- tokenization mishaps (c.f. strawberry).

171. throitallaway ◴[11 Mar 25 20:56 UTC] No.43336908[source]▶

>>43336510 #

The mod system uses lua, so that was always possible.

https://lua-api.factorio.com/stable/

172. throitallaway ◴[11 Mar 25 21:02 UTC] No.43336966{3}[source]▶

>>43332048 #

Gemini Pro does this constantly.

* It'll output a broken script * I tell it what's wrong and how to fix it * It tells me I'm absolutely right and that it will correct it * It outputs a script with the exact same brokenness

173. throitallaway ◴[11 Mar 25 21:05 UTC] No.43336981[source]▶

>>43331994 #

DeepMind went from playing Pong to protein folding in a short number of years. There are much harder things for AI to do than playing video games. Also see: self driving cars.

174. martbakler ◴[11 Mar 25 21:16 UTC] No.43337086{5}[source]▶

>>43336234 #

We did discuss this at some point but didn't end up trying it out. I think it's quite an interesting avenue and worth a shot, my intuition also says that the spatial capabilities will improve if the model has more access to relative info and doesn't need to infer it from absolute coordinates

replies(1): >>43338628 #

175. martbakler ◴[11 Mar 25 21:21 UTC] No.43337120[source]▶

>>43334576 #

This is interesting, one of our findings was that the Claude was capable of essential tasks & simple automation (i.e iron gear wheel factory in lab-play) but didn't even try to do it during the "build the biggest factory" game episodes. So the models can do these essential tasks but when given a general goal, i.e "complete the game", they don't have a good level of long-term planning to even try to attempt them. Often they just did un-coordinated small-scale constructs without attempting to scale up existing factories

That was also one of our goals, to find out how do the models act when given a very vague and general objective

176. cmgriffing ◴[11 Mar 25 21:21 UTC] No.43337129[source]▶

>>43331582 (OP) #

Their first key insight is interesting. It says that coding ability predicts model performance in the game. I wonder if it also predicts performance of human players in some way?

177. devit ◴[11 Mar 25 21:22 UTC] No.43337137{3}[source]▶

>>43334956 #

But it seems like it's being used to generate short snippets that in the examples seem to be equivalent to command lists as opposed to generating a full program that actually plays the whole game by itself.

The model could also then be fed back the results of running the program and iteratively change it as needed.

I.e. prompt first with "Write a program that can play Factorio automatically given an interface <INTERFACE SPECIFICATION> and a set of goals in <GOAL FORMAT>, and produces text output that can help determine whether the program is working correctly and whether tasks are performed efficiently and goals are reached as fast as possible"

And then with "the program was run and produced this text output: <TEXT OUTPUT> Determine any possible bugs, avenues of improvements or missing output information and modify the program accordingly, printing the new version".

And iterate until there doesn't seem to be an improvement anymore.

replies(1): >>43343146 #

178. soulbadguy ◴[11 Mar 25 21:27 UTC] No.43337182{3}[source]▶

>>43334684 #

ahhh another factorio addict :) Curious, how long was your first play through (assuming in v1.x lanching the first rocket)

179. martbakler ◴[11 Mar 25 21:30 UTC] No.43337219{4}[source]▶

>>43335458 #

Indeed I think the trade-off here is the more "pure factorio" types of images we give to the agents, the more likely it is that they've seen it during training (from google etc), however the signal-to-noise ratio is low and hence the current models get confused as the map complexity (amount of entities) and level of detail grows. If we start to create custom images, we can reduce the unneeded noise, but then risk giving something completely OOD to the agent (unless we train a visual encoder) and the performance also tanks

180. csense ◴[11 Mar 25 21:31 UTC] No.43337228{3}[source]▶

>>43334684 #

Agree, "largest" base has some pathologies.

Put machine #1 at the starting location, run in one direction, and put machine #2 just before time runs out.

This is going to be a huge factory (as measured by its bounding box) but it's not super interesting.

181. dismalpedigree ◴[11 Mar 25 22:02 UTC] No.43337542{5}[source]▶

>>43333720 #

Not to mention the increased productivity of everyone not wasting their time in factorio (myself included) because the optimal solution is known.

replies(1): >>43340237 #

182. groby_b ◴[11 Mar 25 22:17 UTC] No.43337687{4}[source]▶

>>43334007 #

> It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens

That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)

Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".

Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]

183. deadbabe ◴[11 Mar 25 22:40 UTC] No.43337878[source]▶

>>43331582 (OP) #

Am I the only one who doesn’t find the results promising?

This is a ton of compute power and complexity for what is basically a shitty AI. It has no practical purpose. Better AIs have been built with less, why don’t people appreciate them? Or do we just take them for granted?

184. aftbit ◴[11 Mar 25 22:41 UTC] No.43337890{3}[source]▶

>>43332667 #

>One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)

This sounds like a great idea for a short story in the style of Malak by Peter Watts. Imagine a future warfighter AI that has been fitted with a set of filters to make it think it's really having a pillowfight or building a factory to make screws while it's actually tearing people apart or optimizing a military production line.

replies(1): >>43343042 #

185. goriv ◴[11 Mar 25 22:46 UTC] No.43337924{3}[source]▶

>>43336201 #

I think he is talking about a human using the programatic API the LLMs are using to play the game. I think that would be a whole lot slower than normal playthrough

replies(1): >>43342280 #

186. janzer ◴[11 Mar 25 23:01 UTC] No.43338047{3}[source]▶

>>43336201 #

The current 4h12m hour record is for 100% (where you have to get every single achievement in the game, in the one run), any% (where you just need to launch a rocket) is under 2 hours (1h42 for the latest factorio v2.x, 1h18 for v1.x). There are a few other differences between the categories regarding map selection and blueprint use as well.

Records and specific rules for all categories can be found at https://www.speedrun.com/factorio

187. stared ◴[11 Mar 25 23:20 UTC] No.43338172{3}[source]▶

>>43332348 #

Since Claude 3.5 Sonnet is that good, I am curious how fares Claude 3.5 Haiku.

For programming-like tasks, I expect similar-ish distribution that in programming, see e.g. https://web.lmarena.ai/leaderboard

188. pyinstallwoes ◴[12 Mar 25 00:18 UTC] No.43338600{5}[source]▶

>>43334820 #

Where’s Waldo test for vlm

189. pyinstallwoes ◴[12 Mar 25 00:21 UTC] No.43338628{6}[source]▶

>>43337086 #

Given vector space on text is more of a spatial space of semantic distance then spatial distance of geometric objects intuitively feel of a different nature due to the fact that words are not at all likely to be represented in similar ratios of distances.

I think a tokenization of ratios between perceived boundaries might help. But, I’m just shooting in the dark.

replies(1): >>43338728 #

190. ajcp ◴[12 Mar 25 00:36 UTC] No.43338728{7}[source]▶

>>43338628 #

You're conflating the use of vectors to only mean how they relate to semantic meaning. As vectors are just spatial relationships, in the case of objects in Factorio we could provide the vectors for every single object as to how they relate to every single other object in literal 2D space. This would essentially provide the LLM a complete relationship mapping, since it is not able to do it by "seeing" a picture or by providing it with absolute coordinates.

replies(1): >>43352839 #

191. noosphr ◴[12 Mar 25 00:36 UTC] No.43338731[source]▶

>>43331582 (OP) #

>We evaluate six frontier language models across both settings: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct.

While I appreciate the effort and creativity that went into this there are a lot of much simpler dynamic benchmarks that can let you saturate the planning capabilities of non-reasoning models.

Something as simple as giving a list of flight connections between cities and then asking for an itinerary between them confuses all these models when the shortest path between two nodes is long enough.

Longest shortest path the models could reliably find (8/10 tests for a given length) between two cities:

    | Model            | Path Length |
    |------------------+-------------|
    | Claude Sonnet3.5 |          10 |
    | GPT-4o           |           7 |
    | GPT-4o-mini      |           4 |
    | Deepseek-v3      |           6 |
    | Gemini-2-Flash   |  Not tested |
    | Llama3.3-70B-Ins |           4 |

replies(2): >>43343033 #>>43351345 #

192. wordpad ◴[12 Mar 25 01:18 UTC] No.43338987[source]▶

>>43335404 #

How is there not a lot of special data?

Isnt it literally infinite via even the simplest simulator?

You could generate an unlimited training set just by implementing tik tac toe on an unbound grid, for example, in like 10 lines of code.

replies(1): >>43340161 #

193. ◴[12 Mar 25 01:31 UTC] No.43339075[source]▶

>>43333045 #

194. jharohit ◴[12 Mar 25 01:54 UTC] No.43339208{3}[source]▶

>>43331985 #

its just for this game - prev I have seen python bots extended to GTA V or Counter Strike or other games. So was wondering if broader set of tools are available?

195. jharohit ◴[12 Mar 25 01:56 UTC] No.43339213{4}[source]▶

>>43332031 #

this is cool! How could one extend that to a broader set of games? E.g another one where you can run larger simulations on behaviour are procedural games like No Man's Sky

replies(1): >>43343157 #

196. jkhdigital ◴[12 Mar 25 02:06 UTC] No.43339247[source]▶

>>43335216 #

This is exactly what I’ve been thinking as I see LLMs being applied to all these complex problem domains. Humans did not conquer the world because our intelligence can solve every problem, we did it by using our intelligence to (1) break down complex problems into small, manageable pieces and (2) designing tools and machines that were exceptionally good at efficiently solving those subproblems.

The other recent example that comes to mind is the paper that explored the reasoning process used by LLMs to answer trivia questions like “Name a national capital whose letters can be rearranged to spell a common greeting in the language of a neighboring country.” (answer is Hanoi by the way)

The LLM responses show that they intuitively grasp the algorithm for answering such a question, but then they basically run the algorithm in their own thoughts (self-talk) which is horrendously inefficient.

Put differently, natural language reasoning is brilliant at turning the messiness of the real world into well-defined abstractions, but as soon as that is done it needs to hand off the task to a machine. For “solved” problems this might be a formally specified machine, but it could also be another class of model such as AlphaZero (along with a proper specification of the problem the “subcontractor” is to handle).

197. jkhdigital ◴[12 Mar 25 02:57 UTC] No.43339521[source]▶

>>43331679 #

Why LLM? Isn’t this what AlphaZero is good at? There are many more kinds of useful ML models than LLMs!

198. owenpalmer ◴[12 Mar 25 05:14 UTC] No.43340161{3}[source]▶

>>43338987 #

Synthetic data will play I big role, yes. There's other challenges though, like how verbal descriptions of objects would affect their spatial behavior. Building a generalized simulator that combines those modalities is hard.

In this particular case with Factorio, I suspect generating the synthetic data would be easier, since the rules of the environment are relatively simple and well defined, with quantifiable outcomes.

199. lukan ◴[12 Mar 25 05:33 UTC] No.43340237{6}[source]▶

>>43337542 #

Not wasted time, you were doing research it seems.

replies(1): >>43349602 #

200. noddybear ◴[12 Mar 25 11:57 UTC] No.43342280{4}[source]▶

>>43337924 #

We were able to pass all the early lab tasks manually - although it took a lot longer than using the UI!

201. jxjnskkzxxhx ◴[12 Mar 25 12:35 UTC] No.43342548[source]▶

>>43331582 (OP) #

I don't understand - were these models post-trained to play factorio? A) If so, how is that possible given that e.g. Claude doesn't have public weights? B) If not, how would the agent know what the API does? Even if it's "guessing" from the English meaning of the API commands (e.g. place_entity_next_to places entity next to something), how would it know what the recipes are? If it's trying and learning we go back to A).

Having read the pdf I don't think these models were post-trained, so how do we explain the questions in B)?

And if indeed there's no post-training and authors expected exploration of recipes to come from the context window.... I think that's way too short for RL-style improvement.

In short, I don't understand how they could've tested those models with post training, and without post training they all did unbelievably well.

If the authors read this: can you give us an idea how many API query and API pairs fit within the context window, on average? Follow up, do you get better results if you abbreviate the API call names, so that more response pairs fit within one context window?

replies(3): >>43342772 #>>43343013 #>>43343573 #

202. tucnak ◴[12 Mar 25 12:41 UTC] No.43342599{3}[source]▶

>>43333429 #

You should learn more about FPGA's!

203. c0wb0yc0d3r ◴[12 Mar 25 13:03 UTC] No.43342772[source]▶

>>43342548 #

The way I read the footnotes about the authors, one works at Anthropic. I would guess that is some insider access.

replies(1): >>43343019 #

204. noddybear ◴[12 Mar 25 13:26 UTC] No.43343013[source]▶

>>43342548 #

These models were not post-trained - all off-the-shelf.

We can fit about 128 pairs maximum in the context, but this performed the same as 32, which we ultimately decided on (for cost, latency purposes).

Encoding the input/outputs to make them shorter degraded performance. It seems that descriptive names is helpful for pretrained models because they have an intuition on what they do.

replies(1): >>43349490 #

205. noddybear ◴[12 Mar 25 13:26 UTC] No.43343019{3}[source]▶

>>43342772 #

One of us works at Anthropic - but we had no insider access to any models or weights. All of our evals were on public models.

206. noddybear ◴[12 Mar 25 13:28 UTC] No.43343033[source]▶

>>43338731 #

This is true - there are simpler benchmarks that can saturate planning for these models. We were motivated to create a broader spectrum eval, to test multiple capabilities at once and remain viable into the future.

replies(1): >>43346931 #

207. noddybear ◴[12 Mar 25 13:29 UTC] No.43343042{4}[source]▶

>>43337890 #

There was a black mirror episode about this too, I seem to remember! Soldiers imagining they were fighting monsters - while actually committing war crimes.

replies(1): >>43374565 #

208. noddybear ◴[12 Mar 25 13:32 UTC] No.43343072{4}[source]▶

>>43334572 #

Oh super interesting! Create 10 scenarios containing working factories, and ‘drop out’ entities to break the factory in different ways. great idea.

replies(1): >>43346582 #

209. noddybear ◴[12 Mar 25 13:34 UTC] No.43343089{4}[source]▶

>>43333191 #

The idea is for us to track all frontier models using the basic agent (goal, tooling info), and then offer another leaderboard for different agent architectures (with retrieval etc).

210. Gasp0de ◴[12 Mar 25 13:38 UTC] No.43343134[source]▶

>>43334576 #

Did you read the page? Because they did give rewards per item produced, and more complex items gave higher rewards.

211. noddybear ◴[12 Mar 25 13:38 UTC] No.43343146{4}[source]▶

>>43337137 #

If I understand you correctly, this approach is sort of supported in FLE - the agents can create functions that encapsulate more complex logic. However, interaction is still synchronous/turn-based. I think to do what you propose, you will need to create event listeners that can trigger the agents program whenever appropriate.

212. noddybear ◴[12 Mar 25 13:40 UTC] No.43343157{5}[source]▶

>>43339213 #

This specific approach relied on: a) availability of multiplayer servers, and b) a remotely accessible console.

I know Minecraft works in the same way - but I’m not sure about RPGs like NMS.

213. martbakler ◴[12 Mar 25 14:23 UTC] No.43343573[source]▶

>>43342548 #

To also jump in here, regarding tools the agents had access to function signatures (i.e tool docstrings, input and output types) and for each tool a small "manual", which described what the tool does, how it affects the game state and a small number of examples where using this tool would be useful (for instance, how to use place_entity_next_to to put an inserter next to an existing chest)

Overall as Jack said, no post-training was done at all but all agents had a complete API description (tools, entities, research) in their context so the results indicate to some level how well can modern agents use a completely OOD API with decent level of documentation

214. ramesh31 ◴[12 Mar 25 14:32 UTC] No.43343680[source]▶

>>43331582 (OP) #

Claude in a league of it's own of course. It's so painfully obvious Anthropic is going to the be the winner here, barring any further upsets.

215. nickvec ◴[12 Mar 25 18:34 UTC] No.43346300[source]▶

>>43331582 (OP) #

> The factory consists of a electricity steam generator (top-left)...

nit: "a" should be "an" here

216. infogulch ◴[12 Mar 25 19:03 UTC] No.43346582{5}[source]▶

>>43343072 #

Yes exactly! This approach can generate hundreds of "fix the problem"-type tests very easily. With some creative thinking I suspect you can use variations to stack multipliers on other types of tests as well.

217. noosphr ◴[12 Mar 25 19:43 UTC] No.43346931{3}[source]▶

>>43343033 #

That's fair enough, but you should test other frontier model types to see if the benchmark makes sense for them.

For example the shortest path benchmark is largely useless when you look at reasoning models - since they have the equivalent of scratch paper to work through their answers the limitation became their context length rather than any innate ability to reason.

218. jxjnskkzxxhx ◴[13 Mar 25 01:24 UTC] No.43349490{3}[source]▶

>>43343013 #

Follow up. Do you have an hypothesis why Claude performs much better than the rest at these tasks?

Is it just because Clause is the best at coding and the API is code? (not very interesting). Maybe if the API required the llms to write in poems, the best LLM at poetry would win...

Or is it because whatever makes claude good at coding, also makes it good at mathematical-like tasks. This is more interesting, as it would show some transfer learning. It would also suggest if you're doing training for a specific task, you would also benefit from training adjacent tasks e.g. if you're training for maths you could benefit from training coding. I believe this is actually true for humans.

And would you know how to check whether if any of the above hypothesis is correct?

219. dismalpedigree ◴[13 Mar 25 01:44 UTC] No.43349602{7}[source]▶

>>43340237 #

Good point. My wife will surely understand if I explain it as “research”

220. pyinstallwoes ◴[13 Mar 25 12:57 UTC] No.43352839{8}[source]▶

>>43338728 #

Yeah but that’s a biased approximation at the cost of an assumption in equivalence and not truth distills equivalent in ratio. You’d have to treat tokens at some universal distance if one unit to approximate some unit of measurement along hwd/magnitude.

Overall visual perception is about noticing comparative differences not measuring absolute quantity.

221. aftbit ◴[15 Mar 25 19:13 UTC] No.43374565{5}[source]▶

>>43343042 #

This was the central plot twist of "Spec Ops: The Line", a video game from 2012 that started out like your typical Call of Duty clone shooter and escalated to an interesting if a bit twisted look at how PTSD affects soldiers.

222. deterministic ◴[16 Mar 25 06:49 UTC] No.43377310{3}[source]▶

>>43332409 #

I have tried different approaches and ended up with a single small bus of raw materials (coal,cobber,iron,stone) with everything else hanging off it. It scales amazingly well and avoid spaghetti layouts.

Oil stuff is done separately and fed into the structure where needed.

↑