Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?
FLE is an open-source framework for developing and evaluating LLM agents in Factorio. It provides a controlled environment where AI models can attempt complex automation, resource management, and optimisation tasks in a grounded world with meaningful constraints.
A critical advantage of Factorio as a benchmark is its unbounded nature. Unlike many evals that are quickly saturated by newer models, Factorio's geometric complexity scaling means it won't be "solved" in the next 6 months (or possibly even years). This allows us to meaningfully compare models by the order-of-magnitude of resources they can produce - creating a benchmark with longevity.
The project began 18 months ago after years of playing Factorio, recognising its potential as an AI research testbed. A few months ago, our team (myself, Akbir, and Mart) came together to create a benchmark that tests agent capabilities in spatial reasoning and long-term planning.
Two technical innovations drove this project forward: First, we discovered that piping Lua into the Factorio console over TCP enables running (almost) arbitrary code without directly modding the game. Second, we developed a first-class Python API that wraps these Lua programs to provide a clean, type-hinted interface for AI agents to interact with Factorio through familiar programming paradigms.
Agents interact with FLE through a REPL pattern: 1. They observe the world (seeing the output of their last action) 2. Generate Python code to perform their next action 3. Receive detailed feedback (including exceptions and stdout)
We provide two main evaluation settings: - Lab-play: 24 structured tasks with fixed resources - Open-play: An unbounded task of building the largest possible factory on a procedurally generated map
We found that while LLMs show promising short-horizon skills, they struggle with spatial reasoning in constrained environments. They can discover basic automation strategies (like electric-powered drilling) but fail to achieve more complex automation (like electronic circuit manufacturing). Claude Sonnet 3.5 is currently the best model (by a significant margin).
The code is available at https://github.com/JackHopkins/factorio-learning-environment.
You'll need: - Factorio (version 1.1.110) - Docker - Python 3.10+
The README contains detailed installation instructions and examples of how to run evaluations with different LLM agents.
We would love to hear your thoughts and see what others can do with this framework!
Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?
Models struggle in 2 main areas. The first is spatial reasoning: often the models make off-by-one errors which they find it hard to recover from (as factories are very sensitive to these mistakes - like in programming). The second is in long-term planning, i.e figuring out what to do strategically, before making tactical subgoals.
The difficulty scales in lab-play generally in proportion to the depth of the production chains. If an item requires several factory segments first, this makes it a lot more challenging. I think this is related to planning though, as the models tend to get down 'into the weeds' of fixing minor issues - rather than coming up with a master plan first.
More seriously, I think this is a great "next step" in the evolution of benchmarks. Contemporary benchmarks are really just standardized tests, but the Factorio problem set presents an unbounded canvas of creativity for solutioning.
For something humans will definitely not put up with, install PyBlock in Hard Mode. I suspect that benchmark will not fall anytime soon. It is borderline impossible without superhuman patience.
This reflects my experience with gen-LLM coding, where LLMs keep trying to do the same thing in a loop.
> Models with stronger coding abilities (Claude 3.5-Sonnet, GPT-4o) achieved higher Production Scores and completed more lab tasks. Claude outperformed others with a PS of 293,206 and 28 milestones, progressing beyond early-game resource extraction.
Not even humans can pass this benchmark.
Since a good AI is too likely to beat humans due to high APM, don't limit their intelligence, instead limit their APM...
Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.
Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.
It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.
I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.
I think this very clearly illustrates a big weakness of current LLMs-- humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't-- yet.
I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.
Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?
There's probably also a distorting factor in that all the AI research into stock market and military applications probably doesn't get published, so it seems like video game AIs are a much larger percentage of research than it actually is.
The main goal of an enemy AI isn't to be the hardest thing in the world, it's to provide an interesting challenge for the player to overcome. It's not necessarily difficult to make a hypercompetent AI in most games, but that also wouldn't make it very interesting to play against. Most games have finite states of logic, just large enough to the point where a human would have trouble finding every solution to it (although humans tend to be very good at pushing on the edges of these states to find ways around them).
Even in games where the amount of state is much higher than usual, you rarely want a super AI; nobody likes playing against an aimbot in an FPS for example.
Factorio is an outlier because unlike regular games, the true condition for a "victory" is almost entirely up to the player. You can make a rocket in non-DLC Factorio (the games victory condition) without building any factory at all beyond the most basic structures for stuff you can't handcraft. It'd be extremely slow, but it's an option. That's why the benchmark for this sort of thing is more efficiency than it is "can this work".
LLMs tend to build themselves into corners here quite often. Basically, if they break the topology (e.g enclose their factory in pipes) they struggle to reason over it and correct it. My basic view on this is that there exists some set of functions/data-structures that they can design in FLE, which will give them a better view over their factory to enable scaling (if the models take a step back to consider it).
We currently do track SPM, but decided against making that our main metric, as it zeroes out in the early stages. We use 'production score' instead, which is a more generalised metric that just captures total production (multiplied by an item-price).
There was a cool paper that came out a few years ago using meta-heuristics to do this, (https://arxiv.org/abs/2102.04871), but I reckon the combinatorial complexity of large factories makes it challenging to solve beyond trivial factories.
Its worth noting that agents in FLE can write their own libraries etc, so a dominant strategy could be for an LLM agent to implement a solver in Python to do the heavy lifting. This is quite far from current capabilities though.
There's no problem asking AI for the blueprints to a working faster-than-light spaceship, only we already know the AI will fail, and the way it fails provides no useful information.
Something about the track building being clunky, or I don't know what really the underlying thing that's making me prefer the simplicity of items-moving-and-splitting-and-merging-on-belts is
leaderboard: https://jackhopkins.github.io/factorio-learning-environment/...
Yes, the agents can consistently produce economic growth in game - but we don't really see a take off, where the growth keeps compounding over time. This is certainly _possible_ in FLE, as agents could write their own Python utility functions etc to construct and manage large factories (imagine imperative Factorio blueprints), but we haven't seen that yet.
Designing the API to not get in the way was the biggest challenge. It was imperative to avoid modal collapse - where the factory could not be sufficiently well expressed in the outputs of a program. While we think that we have generally 'solved' this, there are occasionally examples where the agent acts based on its previous output, but fails because there is something blocking it that it cannot easily see. One example would be the edge of water getting in the way of an entity placement.
All of the lab tasks were completed by a human using only the API, and we have lots of tests (inductively) demonstrating that it is possible to get to a rocket launch using the API alone.
Simply giving the agents access to the tool descriptions and API schema is like 20k tokens from the outset.
It would be really cool to use retrieval techniques to reduce this burden. I suspect that this will also outright improve the performance of all models - which becomes worse as the context scales.
I'm an anti-bus extremist. ( I've even considered regsitering BanTheBus.com and doing an over-the-top static anti-bus website. ), so take what I'm about to say with a pinch of salt. Also note that my post applies to non-space-age. Space-age changes the gameplay fundamentally, so this really only applies to factorio 1.1 or 2.0 with space-age disabled. Gleba in particular breaks the JIT model. ( Fuck Gleba. )
Busses are the opposite of good factorio factories. They undo a lot of the benefits of a healthy just-in-time (JIT) manufacturing, by encouraging massive amounts of buffer (belt-buffer).
They also encourage people to anti-learn fundamental principles. You often people do "starter-busses" with 4 lanes of iron plates, but only fed by one actual belt worth of smelting. Then people look for all kinds of "balancing" solutions to try to alchemize one belt into 4 belts.
They encourage massive amounts of over-spend on expensive splitters to keep "balancing" the bus to make it look neat, over actually just focusing on what needs to be built.
Spaghetti on the other hand is much better for actually getting to the end-goal. Start by placing what you want to build, then look for what it needs. Work out how to feed it by any means necessary. If you dont' have enough input, then build more of that input. Then repeat as necessary.
There's no such thing as too much input. With short belts (even direct insert where possible), buffers are kept to a minimum, and any "overprodction" is stopped at source, because assemblers don't produce if they have nowhere to output into.
The biggest classic beginner mistakes in factorio are:
- Sticking things in chests. Even worse, trying to "maintain production" by picking up those chest contents. ( This comes from an RTS mindset where "idle" workers are a big sin. )
- Trying to increase throughput by replacing yellow belt with red belt when their yellow belt wasn't saturated.
- Looking for guides and discovering "The Main Bus".
That last point is so common, and not only does it take away some of the creativity of the game, but busses are inherently a bad solution that makes all bases look the same, and produces a mediocre result.
Look at how speedrunners are able to complete the game on default settings in sub 2hr30. They're not producing oodles of red belts. They're not producing main busses. They're not even producing railways. They're hyper focused on what's actually needed, which is very little indeed.
I have anecdotally tried using screenshots to help models debug their factories, but without training a custom CNN/ViT on the Factorio UI, the visual outputs miss critical things (e.g gaps in transport belts).
That said, we have demonstrated via unit tests that the API is technically sufficient to progress to a rocket launch alone. We have been able to complete most lab tasks using the API ourselves so the humans still have a hefty lead here! The ones that we didn't do are the late-game lab tasks, which would have taken significant time and which frontier models are far from being able to complete.
It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.
I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.
Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.
However, our results only loosely indicate that smarter models use fewer ticks. I have an intuition that we will get more signal for tick-efficiency as models improve.
We designed the API to be as spatially descriptive as possible (include x-y coordinates and neighbors in game state descriptions) and the agents have tools to aid them in carrying out actions which would benefit from vision (i.e find buildable areas on the map with different sizes, placing entities next to other entities etc).
As Jack said, we completed most of the lab tasks manually ourselves and while it took us a lot longer compared to having vision, the tasks were still doable and the human performance is significantly higher than current agents. We are thinking of supporting vision for future evals but from a small number of tests we ran, current models got even more confused as the number of entities on the map grows quite quickly. This is likely due to VLMs being notoriously bad at visual reasoning on images with lots of detail and in a game where one misplaced entity in a large factory breaks everything, the errors start to compound
Regarding layout optimisation benchmarks, we actually discussed this yesterday. I think we need 2 types of layout task: 1) fix this subtly broken factory, and 2) improve the throughput of this factory. These should be straightforward to implement, if you'd like to have a look.
I disagree that you need significant amount of thinking ahead. At the beginning spaghetti belt is fine, as you have little resources and you don't have the luxury of overbuilding. Once you start getting "bigger" and into more complex designs you can just leave what you already built how it is and build the new stuff somewhere else.
By the time you need to produce thousands of pieces of an item you can probably prepare a blueprint that builds the whole factory in a click.
My approach to factor.io is built on phasesw
1: build ad hoc infrastructure for the specific material that I need, close to the raw resources
2: prepare blueprints for specific resources, so that if I need more of something I can just build an extra factory. I make the blueprints so that I can compose them, like input belts on one side and output belt on the other. such "factories" are almost self contained, as in they get only a subset of materials (plates, plastic and stuff that involves liquids) and produce all the intermediate materials. This leaves some optimizations on the table, but simplify the logistic. Use trains to fetch resources from far.
3: compose the blueprints of the previous step to make "megafactories" with stations included. While at step 2 input and output of the factories are belts, at this step the input/output are train stations for specific material (with proper names, so I can add a new factory and trains will start delivering materials right away)
Of course my approach is not the only possible and probably not even efficient. I play for fun, with no care for the time it takes, as long as the time spent is enjoyable.
Edit: IMO the biggest difference between Satisfactory and Factorio is that Satisfactory has no crises. If a Satisfactory base shuts down it is annoying, but you can dig another miner / / build another plant / etc, entirely at your leisure. But in Factorio, a shutdown is an emergency with a ticking clock.
First MVP stupid designs, then optimized routing, and eventually usable ingame where it connects with provided in/outputs.
Would be more fun to develop than to play obviously..
I liked the nilhouse mega base with that factory-train-blocks blueprints, its basically Factorio DUPLO.
https://factorio.com/blog/post/fff-377, https://factorio.com/blog/post/fff-389, https://factorio.com/blog/post/fff-403
Advantage of yellow belt:
- Don't consume power - Don't produce pollution ( don't attract bugs ) - Are cheap. Rails themselves aren't too bad, costing only slighty more per tile than belts. But rails need stations, signals, chests, inserters for loading / unloading. Often combinators too. All add up to much higher investment, and critically in the early game, add up to more pollution produced before the pay-off of improved resources incoming. - Don't need lots of belt for the stations. A typical loading/unloading station can often have so much belt for "efficient" (fast) loading/unloading that you could have belt half your way to where you're going just laying that belt in a straight line.
If you're going further than the first ring of extra resources, then trains are amazing, but there are more than enough resources several times over in the first ring of resources to get a rocket launched. The first expansion ring tends to have 500k-1Mil per patch, and have several patches, so there's no need to go miles out pre-rocket.
It takes surprisingly little raw resources to actually launch a rocket. Someone did the maths once and calculated completing the game as needing a minimum of ~500k Iron ore. At 15/s, a single yellow belt can deliver that in under 10 hours, way below the time of a typical playthrough. This technically means a single smelting line is all you need to actually complete the game still in a reasonable time. Of course, trying to do so from a single lane would be extremely painful, and need a lot of attention to preventing over-production of intermediates, especially when it's much easier to make a few smelting lanes. I'm not recommending that, but I am recommending just slapping down new lanes and production wherever you feel like it and whenever you need, rather than pre-planning 4-lanes of plates that aren't actually useful of efficient.
Unless you mean your "main bus" is just 1x copper and 2x iron lane, in which case fair enough, but when I attack the concept of the bus, that's not what I'm railing against. What I'm railing against is the design pattern where people put down 4 lots of 4-wide lanes to bus far more iron, copper and intermediates than will ever be needed to actually launch a rocket.
Busses aren't efficient by any metric, other than minimising personality, creativity and thinking.
With a bus you just follow the same template done before, it'll get you to the end, but it teaches bad habits and isn't efficient. It isn't quick either, you'll spend a long time putting down lanes and splitters and undergrounds. All of which need producing.
For late-game (post-rocket), then either a bus design or train / city-block designs can work well. I prefer trains, but large busses have their place for mega-bases.
But for pre-rocket, or anyone starting out, spaghetti is absolutely the way to go. It'll also better teach you via your own mistakes.
I can’t tell from the paper or these comments if you’re sending multimodal data back — I’m guessing no, because many of these models aren’t multimodal. But some are — and of course we now have recently released Qwen 2.5 VLM which seems to be quite strong for its size.
You harp on this lack of spatial ability a fair amount, which - fair enough - and you mention difficulties in both planning and spatial planning. Are you sending images back? If not, any thoughts on this?
Thanks for this amazing bit of work, I really am reorganizing my day to play with it now.
P.s. seems like MCP enabling the python library is a natural must-do so that all tool-enabled LLMs everywhere can play factorio.
So, I almost never build a large main bus style base, but it’s a fun (and helpful) part of the game’s design and tooling to allow you to create very large working systems out of a variety of component scales — “buses” enable this, and make it MUCH easier to implement.
I'm not sure we can conclude the game's not in their training set - a lot has been written about Factorio on the internet, and a lot has been recorded; the newer multimodal LLMs have trained on YouTube, especially Gemini.
But it (for me at least) is so much more fun building the spaghetti and making things work, refactoring as you go, and expanding organically.
I wonder what a quick way to calculate how many Unicode characters you’d need is.. I guess every entity + four orientations. Underground belts and pipes seem tough. But I guess you could just add an encoding showing if the square has an underground pipe or encoding.
I propose this would work. I think I’ll give it a try today.. I’d love dwarf fortress factorio. That said, the encode/decode phase seems like a lot of tokens for a model that’s not trained to understand the Unicode ‘map’. Seems like you’d want to fine tune something at least. Maybe a layout model.
Good point with MCP as well given it has been blowing up lately, we'll look into that!
I don't remember the details, as I've only watched for a few minutes and found the whole thing boring, but he seems to have been going at it for a few weeks now, so it's probably working on some level.
I think this could be a good starting point for what you describe! This stuff is always more fun to develop than play. Since started working on this project, I can't bring myself to play the core game myself...
The main bus is for production capacity discovery. New players have to discover the production tree somehow and browsing recipe cards with a calculator before starting is not the way, it has to be hands-on and interactive. Dividing up your factory into (arbitrarily defined) intermediates production and final product production is helpful to cut through the complexity. Of course, intermediate production will be severely underbuilt and logistics capacity will never keep up with consumption of all of the final product builds, but it will be enough to run a couple builds at once and test new ones.
Once new players have a grasp of the scope of the production tree, with tangible examples of sub-factories, then they can begin to consider end-to-end production capacity.
Sure, if you already know what you're doing, you can plan ahead and leave space for the right things. For everyone else, it simplifies the game to a manageable level.
Inworld's been doing this but haven't seen what they've done recently. https://inworld.ai/blog/inworld-stardew-valley-ai
The question is what is the most efficient and high-quality representation we could use to improve that
I think you just described a research paper that would advance sota. Less describing why, but how. (Assuming it's not just, wy finetuned the model and it worked perfectly)
Our sparse encoding seems to confuse the models less - even though it certainly isn't perfect.
I work on a similar factory game (Captain of Industry) and I have always wanted an agent that can play the game for testing and balancing reasons. However, pixels-to-mouse-actions RL policy (similar to Deep Mind's StarCraft agent) always seemed like a very hard and inefficient approach. Using code-like API seems so much better! I might try to find some time to port this framework to COI :) Thanks for sharing!
The factory is not powered, place the missing power pole(s)
The factory is missing items, place the missing belt(s)
Craft and place these 200 assembly machines
The assembly machine is not running for some reason, fix it
The factory production is too low, double it
Get to this other point in the factory as fast as possible
Fix the brownout
All of the above with and without bots
Programmatically generating a few thousand example scenarios like these should be relatively easy. Then use it like an IQ test question bank: draw a dozen scenarios from the bank and evaluate performance on each based on time & materials used.I hypothesize that ML agents learn faster when evaluated on a sample from a large bank of scenarios of smoothly increasing complexity where more complex scenarios are presented after it scores sufficiently high on lower complexity scenarios.
With no buffering, as soon as your demand for steel is greater than than production of steel, then the bottleneck is immediate and obvious. The solution is also immediate and obvious: Build more steel.
Buffering, in particular belt buffering, and in particular busses, contribute to mask this issue. There can be a great delay between increasing consumption above production, and so the root cause can be very hidden. It may also be that the ultimate root cause is that steel production is low because it's limited on how much iron ore it gets. If everything is bussed, then it can be hours before resource constraints are hit, by which point it's very hard to see what's happened to cause the shortage, and also by which time the factory may have expanded further.
It also constributes to see-saw production, where-by a shortage in one area causes a pause, which allievates the root cause shortage for a while by backing up other production. The longer the lag between cause and effect, the greater the banding effect, further masking the root cause.
A bus also encourages bottom-up, which further encourages massive over-consumption of base resources. If you start building green chip production and bussing it, the bus may will fill and buffer, making it look like you've got plenty of green chips. In turn, as the green chip production stops, it'll look like you've got plenty of iron plates. You'll build all your malls and other production, satisfied as you build each that they run fine and are not over-consuming.
Only when later everything starts to run at once you realise that the stuff down the end of the line is getting scant resources, as previously each part was running in isolation before serious amounts were required.
In contrast, a top-down approach involves building the final result first, then at each step building what's needed to feed it. This ensures that there is always enough provision, and everything can be placed to minimise buffer to reduce lag and improve feedback time on problems. It also reduces pollution since any item on a belt represents inventory for which you've paid a pollution cost but not got any final results from yet.
The spaghetti approach can lead to "under-utilised" buildings, such as smelting array that ends up only needing to supply 0.3 of a belt. But in factorio space is almost endless, and there's little to no cost to idle buildings. The power drain of idle assemblers, particular the bare (no module) level 2 buildings you'll likely be building before end-game, is extremely low.
For late game post-rocket, this changes of course. With beacons and level 3 assemblers with modules, the idle draw is significant, and you may want to optimise ratios and look to eliminate how many assemblers you run idle. ( That said, power is almost non-issue in 2.0 with nuclear power being much easier to run efficiently than previously, so the large solar fields aren't really needed anymore. )
Busses have a strong visual appeal, but unlike "cable management", there's no airflow to consider in factorio. A messy spaghetti base isn't inherently inefficient. It doesn't affect productivity to just run short belts all over.
The visual temptation of the mega-bus is clearly alluring, it looks good on youtube video guides.
That compresses nicely into text, I imagine.
I'd like to hear more details about your symbolic approach!
I believe your intuition about layout experiments needing to be of different genres is correct. I think you could have a pretty wide range of debugging opportunities (imbalanced belts, inserters fighting for items, insufficient power at full load leading to throughput loss, etc) for the first. The second feels like it would be nicely encapsulated by focusing on optimizing for ratios, although seeing an agent realize that they can get more throughput by simply copy/pasting a block and upgrading a belt would be pretty wild (depending on the recipe, of course). Maybe nuclear power / heat exchanger ratios are a bit too far down the path, but optimizing for copper cable use in green circuits is pretty important and fairly early in the tech tree?
I wonder if this same approach could be used here in factorio? Using the pokemon red analogy the main "essential tasks" in Factorio are setting up automation for new items and new science packs. I think a good reward function could involve small rewards functions for production rates of each item/sec, medium rewards for setting up automation for new items, and big rewards for automating each new science pack.
Telling a Factorio agent to just "make a big factory" is like telling a pokemon red agent to just "beat the game", it has to be broken down into smaller steps with a very carefully tuned reward function.
Thinking about this is really making me want to jump into this project!
Until you accidentally feed a different material into your belt and need to clean it up
Also in general I think the issue with Factorio is that you can just find an "optimal" factory design and build order and just follow it every time; perhaps starting with a suboptimal building layout already present and restrictions like being unable to change them or build others of the same type could help.
I think you bring up a good point, we could create tasks where the goal is to optimise a static factory, starting from a kernel of functionality like 'steam engine power supply' etc.
The way I think of it is this. Yes, the LLM is a "general reasoner." However, it's locked in a box, where the only way in and out is through the tokenizer.
So there's this huge breadth of concepts and meanings that cannot be fully described by words (things like, spatial reasoning, smells, visual relationships, cause/effect physical relationships etc). The list of things that can't be described by words is long. The model would be capable of generalizing on those, it would optimize to capture those. But it can't, because the only thing that can fit through the front door is tokens.
It's a huge and fundamental limitation. I think Yann Lecunn has been talking about this for years now and I'm inclined to agree with him. This limitation is somewhat obscured by the fact that we humans can relate to all of these untokenizable things -- using tokens! So I can describe what the smell of coffee is in words and you can immediately reconstruct that based on my description, even though the actual smell of coffee is not encoded in the tokens of what I'm saying at all.
1. create a (intermittent) goal for a resource production
2. create a factory graph with calculated number of machines and number of resources required to transport between them. This would be done by using linear programming (factorio calculator)
3. somehow map the resulting graph to a hardware description language. Such that each entity would be mapped to unique logic component. And each transport lane would be mapped to a unique wire (most difficult)
4. compile to 2d FPGA layout using all the VLSI algos like partitioning, routing (hdl compiler)
5. map the resulting plan back to a concrete factorio design
> the difficulty level of different tasks is subjective
That makes sense. I wonder if difficulty of different scenarios could be derived by assuming a partial ordering and ranking based on training rate: e.g. it preforms better at scenario T if it trains scenario A first, but training scenario first B doesn't help with T. Then infer A < T, and B ? T.
It makes sense why LLMs are bad with spatial reasoning. Not a lot of training data for it. I wonder what additional reasoning abilities will emerge when spatial reasoning is solved.
The reason I was suggesting this is that I worked in robotics making RL policies, and supplying image data (be it maps, lidar scans, etc.) was a common practice. But our networks were custom made to ingest these data and trained from scratch, which is quite different from this approach.
In my experience the current generation of models are very poor at spatial reasoning even when given accurate coordinate based location assignments of each object. But I suspect when a model can build the whole relationship of all objects by being given those spatial relationships in a vector they will be much better.
Coding seems very close to puzzle solving in this regard.
* It'll output a broken script * I tell it what's wrong and how to fix it * It tells me I'm absolutely right and that it will correct it * It outputs a script with the exact same brokenness
That was also one of our goals, to find out how do the models act when given a very vague and general objective
The model could also then be fed back the results of running the program and iteratively change it as needed.
I.e. prompt first with "Write a program that can play Factorio automatically given an interface <INTERFACE SPECIFICATION> and a set of goals in <GOAL FORMAT>, and produces text output that can help determine whether the program is working correctly and whether tasks are performed efficiently and goals are reached as fast as possible"
And then with "the program was run and produced this text output: <TEXT OUTPUT> Determine any possible bugs, avenues of improvements or missing output information and modify the program accordingly, printing the new version".
And iterate until there doesn't seem to be an improvement anymore.
Put machine #1 at the starting location, run in one direction, and put machine #2 just before time runs out.
This is going to be a huge factory (as measured by its bounding box) but it's not super interesting.
That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)
Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".
Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]
This is a ton of compute power and complexity for what is basically a shitty AI. It has no practical purpose. Better AIs have been built with less, why don’t people appreciate them? Or do we just take them for granted?
This sounds like a great idea for a short story in the style of Malak by Peter Watts. Imagine a future warfighter AI that has been fitted with a set of filters to make it think it's really having a pillowfight or building a factory to make screws while it's actually tearing people apart or optimizing a military production line.
Records and specific rules for all categories can be found at https://www.speedrun.com/factorio
For programming-like tasks, I expect similar-ish distribution that in programming, see e.g. https://web.lmarena.ai/leaderboard
I think a tokenization of ratios between perceived boundaries might help. But, I’m just shooting in the dark.
While I appreciate the effort and creativity that went into this there are a lot of much simpler dynamic benchmarks that can let you saturate the planning capabilities of non-reasoning models.
Something as simple as giving a list of flight connections between cities and then asking for an itinerary between them confuses all these models when the shortest path between two nodes is long enough.
Longest shortest path the models could reliably find (8/10 tests for a given length) between two cities:
| Model | Path Length |
|------------------+-------------|
| Claude Sonnet3.5 | 10 |
| GPT-4o | 7 |
| GPT-4o-mini | 4 |
| Deepseek-v3 | 6 |
| Gemini-2-Flash | Not tested |
| Llama3.3-70B-Ins | 4 |
Isnt it literally infinite via even the simplest simulator?
You could generate an unlimited training set just by implementing tik tac toe on an unbound grid, for example, in like 10 lines of code.
The other recent example that comes to mind is the paper that explored the reasoning process used by LLMs to answer trivia questions like “Name a national capital whose letters can be rearranged to spell a common greeting in the language of a neighboring country.” (answer is Hanoi by the way)
The LLM responses show that they intuitively grasp the algorithm for answering such a question, but then they basically run the algorithm in their own thoughts (self-talk) which is horrendously inefficient.
Put differently, natural language reasoning is brilliant at turning the messiness of the real world into well-defined abstractions, but as soon as that is done it needs to hand off the task to a machine. For “solved” problems this might be a formally specified machine, but it could also be another class of model such as AlphaZero (along with a proper specification of the problem the “subcontractor” is to handle).
In this particular case with Factorio, I suspect generating the synthetic data would be easier, since the rules of the environment are relatively simple and well defined, with quantifiable outcomes.
Having read the pdf I don't think these models were post-trained, so how do we explain the questions in B)?
And if indeed there's no post-training and authors expected exploration of recipes to come from the context window.... I think that's way too short for RL-style improvement.
In short, I don't understand how they could've tested those models with post training, and without post training they all did unbelievably well.
If the authors read this: can you give us an idea how many API query and API pairs fit within the context window, on average? Follow up, do you get better results if you abbreviate the API call names, so that more response pairs fit within one context window?
We can fit about 128 pairs maximum in the context, but this performed the same as 32, which we ultimately decided on (for cost, latency purposes).
Encoding the input/outputs to make them shorter degraded performance. It seems that descriptive names is helpful for pretrained models because they have an intuition on what they do.
Overall as Jack said, no post-training was done at all but all agents had a complete API description (tools, entities, research) in their context so the results indicate to some level how well can modern agents use a completely OOD API with decent level of documentation
For example the shortest path benchmark is largely useless when you look at reasoning models - since they have the equivalent of scratch paper to work through their answers the limitation became their context length rather than any innate ability to reason.
Is it just because Clause is the best at coding and the API is code? (not very interesting). Maybe if the API required the llms to write in poems, the best LLM at poetry would win...
Or is it because whatever makes claude good at coding, also makes it good at mathematical-like tasks. This is more interesting, as it would show some transfer learning. It would also suggest if you're doing training for a specific task, you would also benefit from training adjacent tasks e.g. if you're training for maths you could benefit from training coding. I believe this is actually true for humans.
And would you know how to check whether if any of the above hypothesis is correct?
Overall visual perception is about noticing comparative differences not measuring absolute quantity.
Oil stuff is done separately and fed into the structure where needed.