Show HN: Factorio Learning Environment – Agents Build Factories

1. vessenes ◴[11 Mar 25 14:53 UTC] No.43333045[source]▶

OK, You’ve permanently nerd-baited me, and I wish to apply for a job at the Anthropic Factorio lab immediately.

I can’t tell from the paper or these comments if you’re sending multimodal data back — I’m guessing no, because many of these models aren’t multimodal. But some are — and of course we now have recently released Qwen 2.5 VLM which seems to be quite strong for its size.

You harp on this lack of spatial ability a fair amount, which - fair enough - and you mention difficulties in both planning and spatial planning. Are you sending images back? If not, any thoughts on this?

Thanks for this amazing bit of work, I really am reorganizing my day to play with it now.

P.s. seems like MCP enabling the python library is a natural must-do so that all tool-enabled LLMs everywhere can play factorio.

replies(3): >>43333278 #>>43333554 #>>43339075 #

2. martbakler ◴[11 Mar 25 15:12 UTC] No.43333278[source]▶

>>43333045 (TP) #

Currently it's a text-only modality environment but we are planning to support vision in the future. We did run a couple of tests and saw that including screenshots of the game state did not improve performance on the off-the-shelf models. As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because the VLMs currently aren't good at spatial reasoning in high-detailed images, likely this would improve significantly with finetuning

Good point with MCP as well given it has been blowing up lately, we'll look into that!

replies(2): >>43333560 #>>43334032 #

3. jillyboel ◴[11 Mar 25 15:38 UTC] No.43333554[source]▶

>>43333045 (TP) #

Why would screenshots be necessary if a textual description of the factory state is both easier to interpret and less prone to confusion? The game is played on a grid, so converting the game state to ascii ought to be trivial.

replies(2): >>43333584 #>>43334007 #

4. vessenes ◴[11 Mar 25 15:39 UTC] No.43333560[source]▶

>>43333278 #

That makes sense and it’s really interesting - it is a challenging visual test for sure; thousands of entities, either multi tier visual representations (screen, map, overview map) or a GIANT high res image. I hereby propose FLE-V a subset benchmark for visual models where they just turn a factorio image into a proper FLE description. And maybe the overview and map images as well.

replies(2): >>43333720 #>>43334019 #

5. vessenes ◴[11 Mar 25 15:41 UTC] No.43333584[source]▶

>>43333554 #

Trivial as in only engineering work, sure. But it’s a lottt of tokens. Long context models do a number of things to get all that working context in; some of those things elide details / compress / have token segments that are harder to reason about. When a burner inserter at a location takes up like 50-100 tokens, and you want it to reason about 100 of them, this is still a pretty challenging task for any LLM.

replies(1): >>43333621 #

6. jillyboel ◴[11 Mar 25 15:44 UTC] No.43333621{3}[source]▶

>>43333584 #

Ah, I don't know much about multi modal models but I wonder what they'd think of pixel art representing the factory where each pixel is a point on the grid and each color is a specific entity, perhaps ignoring things such as bots flying about. Probably easier to comprehend than an actual screenshot?

replies(2): >>43333738 #>>43334166 #

7. kridsdale1 ◴[11 Mar 25 15:54 UTC] No.43333720{3}[source]▶

>>43333560 #

Such research could have hundreds of billions of dollars in downstream GDP implications when applied to real industrial settings.

replies(2): >>43334227 #>>43337542 #

8. kridsdale1 ◴[11 Mar 25 15:56 UTC] No.43333738{4}[source]▶

>>43333621 #

I mean at some point you compress the board state down to Dwarf Fortress with an extended ASCII representation for each grid-state (maybe 2 bytes each?)

replies(1): >>43334261 #

9. martbakler ◴[11 Mar 25 16:17 UTC] No.43334007[source]▶

>>43333554 #

It actually is engineering wise quite trivial but the underlying question is which modality is the best to elicit spatial reasoning capabilities from the current general models. We tried (very anecdotally) a couple of months ago to get an agent to reason over a couple of ascii representations of factories and the results weren't very promising. It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens

The question is what is the most efficient and high-quality representation we could use to improve that

replies(2): >>43336234 #>>43337687 #

10. ◴[11 Mar 25 16:18 UTC] No.43334019{3}[source]▶

>>43333560 #

11. grayhatter ◴[11 Mar 25 16:19 UTC] No.43334032[source]▶

>>43333278 #

> As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because [...]

I think you just described a research paper that would advance sota. Less describing why, but how. (Assuming it's not just, wy finetuned the model and it worked perfectly)

replies(1): >>43334820 #

12. noddybear ◴[11 Mar 25 16:31 UTC] No.43334166{4}[source]▶

>>43333621 #

The thing is that when you create a dense ASCII representation, any gain you might make from the spatial relationships is lost by: a) the tokeniser not working on characters alone (remember strawberrry), and b) the increased number of 'dead' tokens encoding not very much.

Our sparse encoding seems to confuse the models less - even though it certainly isn't perfect.

13. vessenes ◴[11 Mar 25 16:36 UTC] No.43334227{4}[source]▶

>>43333720 #

Well I better get training!

14. vessenes ◴[11 Mar 25 16:38 UTC] No.43334261{5}[source]▶

>>43333738 #

Lots of questions here - you need item, orientation, info about pipes (2 directions) , belts (3 or 4 colors x2 directions). Do you wish Circuits?

15. martbakler ◴[11 Mar 25 17:19 UTC] No.43334820{3}[source]▶

>>43334032 #

Sounds almost like a visual "needle in a haystack" type of work, that could be quite interesting!

replies(1): >>43338600 #

16. ajcp ◴[11 Mar 25 19:38 UTC] No.43336234{3}[source]▶

>>43334007 #

Did you try providing 2D vectors of where each object relates to every other object? Seems like the most obvious way.

In my experience the current generation of models are very poor at spatial reasoning even when given accurate coordinate based location assignments of each object. But I suspect when a model can build the whole relationship of all objects by being given those spatial relationships in a vector they will be much better.

replies(1): >>43337086 #

17. martbakler ◴[11 Mar 25 21:16 UTC] No.43337086{4}[source]▶

>>43336234 #

We did discuss this at some point but didn't end up trying it out. I think it's quite an interesting avenue and worth a shot, my intuition also says that the spatial capabilities will improve if the model has more access to relative info and doesn't need to infer it from absolute coordinates

replies(1): >>43338628 #

18. dismalpedigree ◴[11 Mar 25 22:02 UTC] No.43337542{4}[source]▶

>>43333720 #

Not to mention the increased productivity of everyone not wasting their time in factorio (myself included) because the optimal solution is known.

replies(1): >>43340237 #

19. groby_b ◴[11 Mar 25 22:17 UTC] No.43337687{3}[source]▶

>>43334007 #

> It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens

That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)

Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".

Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]

20. pyinstallwoes ◴[12 Mar 25 00:18 UTC] No.43338600{4}[source]▶

>>43334820 #

Where’s Waldo test for vlm

21. pyinstallwoes ◴[12 Mar 25 00:21 UTC] No.43338628{5}[source]▶

>>43337086 #

Given vector space on text is more of a spatial space of semantic distance then spatial distance of geometric objects intuitively feel of a different nature due to the fact that words are not at all likely to be represented in similar ratios of distances.

I think a tokenization of ratios between perceived boundaries might help. But, I’m just shooting in the dark.

replies(1): >>43338728 #

22. ajcp ◴[12 Mar 25 00:36 UTC] No.43338728{6}[source]▶

>>43338628 #

You're conflating the use of vectors to only mean how they relate to semantic meaning. As vectors are just spatial relationships, in the case of objects in Factorio we could provide the vectors for every single object as to how they relate to every single other object in literal 2D space. This would essentially provide the LLM a complete relationship mapping, since it is not able to do it by "seeing" a picture or by providing it with absolute coordinates.

replies(1): >>43352839 #

23. ◴[12 Mar 25 01:31 UTC] No.43339075[source]▶

>>43333045 (TP) #

24. lukan ◴[12 Mar 25 05:33 UTC] No.43340237{5}[source]▶

>>43337542 #

Not wasted time, you were doing research it seems.

replies(1): >>43349602 #

25. dismalpedigree ◴[13 Mar 25 01:44 UTC] No.43349602{6}[source]▶

>>43340237 #

Good point. My wife will surely understand if I explain it as “research”

26. pyinstallwoes ◴[13 Mar 25 12:57 UTC] No.43352839{7}[source]▶

>>43338728 #

Yeah but that’s a biased approximation at the cost of an assumption in equivalence and not truth distills equivalent in ratio. You’d have to treat tokens at some universal distance if one unit to approximate some unit of measurement along hwd/magnitude.

Overall visual perception is about noticing comparative differences not measuring absolute quantity.