We are nowhere near the same for autonomous robots, and it's not even funny. To continue to use the internet as an analogy for LLMs, we are pre-DARPANET, pre-ASCII, pre-transistor. We don't even have the sensors that would make safe household humanoid robots possible. Any theater from robot companies about trying to train a neural net based on motion capture is laughably foolish. At the current rate of progress, we are more than decades away.
I’m sure they could pretty easily spin up a site with 200 of these processing packages of most sizes (they have a limited number of standardized package sizes) nonstop. Remove ones that it gets right 99.99% of the time and keep training on the more difficult ones, the move to individual items.
Caveat: I have no idea what I’m talking about.