Show HN: Factorio Learning Environment – Agents Build Factories

1. jxjnskkzxxhx ◴[12 Mar 25 12:35 UTC] No.43342548[source]▶

I don't understand - were these models post-trained to play factorio? A) If so, how is that possible given that e.g. Claude doesn't have public weights? B) If not, how would the agent know what the API does? Even if it's "guessing" from the English meaning of the API commands (e.g. place_entity_next_to places entity next to something), how would it know what the recipes are? If it's trying and learning we go back to A).

Having read the pdf I don't think these models were post-trained, so how do we explain the questions in B)?

And if indeed there's no post-training and authors expected exploration of recipes to come from the context window.... I think that's way too short for RL-style improvement.

In short, I don't understand how they could've tested those models with post training, and without post training they all did unbelievably well.

If the authors read this: can you give us an idea how many API query and API pairs fit within the context window, on average? Follow up, do you get better results if you abbreviate the API call names, so that more response pairs fit within one context window?

replies(3): >>43342772 #>>43343013 #>>43343573 #

2. c0wb0yc0d3r ◴[12 Mar 25 13:03 UTC] No.43342772[source]▶

>>43342548 (TP) #

The way I read the footnotes about the authors, one works at Anthropic. I would guess that is some insider access.

replies(1): >>43343019 #

3. noddybear ◴[12 Mar 25 13:26 UTC] No.43343013[source]▶

>>43342548 (TP) #

These models were not post-trained - all off-the-shelf.

We can fit about 128 pairs maximum in the context, but this performed the same as 32, which we ultimately decided on (for cost, latency purposes).

Encoding the input/outputs to make them shorter degraded performance. It seems that descriptive names is helpful for pretrained models because they have an intuition on what they do.

replies(1): >>43349490 #

4. noddybear ◴[12 Mar 25 13:26 UTC] No.43343019[source]▶

>>43342772 #

One of us works at Anthropic - but we had no insider access to any models or weights. All of our evals were on public models.

5. martbakler ◴[12 Mar 25 14:23 UTC] No.43343573[source]▶

>>43342548 (TP) #

To also jump in here, regarding tools the agents had access to function signatures (i.e tool docstrings, input and output types) and for each tool a small "manual", which described what the tool does, how it affects the game state and a small number of examples where using this tool would be useful (for instance, how to use place_entity_next_to to put an inserter next to an existing chest)

Overall as Jack said, no post-training was done at all but all agents had a complete API description (tools, entities, research) in their context so the results indicate to some level how well can modern agents use a completely OOD API with decent level of documentation

6. jxjnskkzxxhx ◴[13 Mar 25 01:24 UTC] No.43349490[source]▶

>>43343013 #

Follow up. Do you have an hypothesis why Claude performs much better than the rest at these tasks?

Is it just because Clause is the best at coding and the API is code? (not very interesting). Maybe if the API required the llms to write in poems, the best LLM at poetry would win...

Or is it because whatever makes claude good at coding, also makes it good at mathematical-like tasks. This is more interesting, as it would show some transfer learning. It would also suggest if you're doing training for a specific task, you would also benefit from training adjacent tasks e.g. if you're training for maths you could benefit from training coding. I believe this is actually true for humans.

And would you know how to check whether if any of the above hypothesis is correct?