Andrej Karpathy: Software in the era of AI [video]

1. anythingworks ◴[19 Jun 25 01:51 UTC] No.44314766[source]▶

loved the analogies! Karpathy is consistently one of the clearest thinkers out there.

interesting that Waymo could do uninterrupted trips back in 2013, wonder what took them so long to expand? regulation? tailend of driving optimization issues?

noticed one of the slides had a cross over 'AGI 2027'... ai-2027.com :)

replies(2): >>44314822 #>>44315438 #

2. AlotOfReading ◴[19 Jun 25 02:06 UTC] No.44314822[source]▶

>>44314766 (TP) #

You don't "solve" autonomous driving as such. There's a long, slow grind of gradually improving things until failures become rare enough.

replies(1): >>44314866 #

3. petesergeant ◴[19 Jun 25 02:19 UTC] No.44314866[source]▶

>>44314822 #

I wonder at what point all the self-driving code becomes replaceable with a multimodal generalist model with the prompt “drive safely”

replies(4): >>44314937 #>>44315054 #>>44315210 #>>44316357 #

4. AlotOfReading ◴[19 Jun 25 02:35 UTC] No.44314937{3}[source]▶

>>44314866 #

One of the issues with deploying models like that is the lack of clear, widely accepted ways to validate comprehensive safety and absence of unreasonable risk. If that can be solved, or regulators start accepting answers like "our software doesn't speed in over 95% of situations", then they'll become more common.

5. ◴[19 Jun 25 02:59 UTC] No.44315054{3}[source]▶

>>44314866 #

6. anon7000 ◴[19 Jun 25 03:34 UTC] No.44315210{3}[source]▶

>>44314866 #

Very advanced machine learning models are used in current self driving cars. It all depends what the model is trying to accomplish. I have a hard time seeing a generalist prompt-based generative model ever beating a model specifically designed to drive cars. The models are just designed for different, specific purposes

replies(1): >>44315369 #

7. tshaddox ◴[19 Jun 25 04:13 UTC] No.44315369{4}[source]▶

>>44315210 #

I could see it being the case that driving is a fairly general problem, and this models intentionally designed to be general end up doing better than models designed with the misconception that you need a very particular set of driving-specific capabilities.

replies(3): >>44315469 #>>44316063 #>>44318089 #

8. ActorNightly ◴[19 Jun 25 04:26 UTC] No.44315438[source]▶

>>44314766 (TP) #

> Karpathy is consistently one of the clearest thinkers out there.

Eh, he ran Teslas self driving division and put them into a direction that is never going to fully work.

What they should have done is a) trained a neural net to represent sequence of frames into a physical environment, and b)leveraged Mu Zero, so that self driving system basically builds out parallel simulations into the future, and does a search on the best course of action to take.

Because thats pretty much what makes humans great drivers. We don't need to know what a cone is - we internally compute that something that is an object on the road that we are driving towards is going to result in a negative outcome when we collide with it.

replies(5): >>44315487 #>>44315714 #>>44315737 #>>44316272 #>>44317304 #

9. anythingworks ◴[19 Jun 25 04:34 UTC] No.44315469{5}[source]▶

>>44315369 #

exactly! I think that was tesla's vision with self-driving to begin with... so they tried to frame it as problem general enough, that trying to solve it would also solve questions of more general intelligence ('agi') i.e. cars should use vision just like humans would

but in hindsight looks like this slowed them down quite a bit despite being early to the space...

10. visarga ◴[19 Jun 25 04:38 UTC] No.44315487[source]▶

>>44315438 #

> We don't need to know what a cone is

The counter argument is that you can't zoom in and fix a specific bug in this mode of operation. Everything is mashed together in the same neural net process. They needed to ensure safety, so testing was crucial. It is harder to test an end-to-end system than its individual parts.

11. AlotOfReading ◴[19 Jun 25 05:26 UTC] No.44315714[source]▶

>>44315438 #

Aren't continuous, stochastic, partial knowledge environments where you need long horizon planning with strict deadlines and limited compute exactly the sort of environments muzero variants struggle with? Because that's driving.

It's also worth mentioning that humans intentionally (and safely) drive into "solid" objects all the time. Bags, steam, shadows, small animals, etc. We also break rules (e.g. drive on the wrong side of the road), and anticipate things we can't even see based on a theory of mind of other agents. Human driving is extremely sophisticated, not reducible to rules that are easily expressed in "simple" language.

replies(1): >>44334078 #

12. tayo42 ◴[19 Jun 25 05:31 UTC] No.44315737[source]▶

>>44315438 #

Is that the approach that waymo uses?

replies(1): >>44332810 #

13. shakna ◴[19 Jun 25 06:36 UTC] No.44316063{5}[source]▶

>>44315369 #

Driving is not a general problem, though. Its a contextual landscape of fast-based reactions and predictions. Both are required, and done regularly by the human element. The exact nature of every reaction, and every prediction, change vastly within the context window.

You need image processing just as much as you need scenario management, and they're orthoganol to each other, as one example.

If you want a general transport system... We do have that. It's called rail. (And can and has been automated.)

replies(2): >>44316240 #>>44318075 #

14. melvinmelih ◴[19 Jun 25 07:09 UTC] No.44316240{6}[source]▶

>>44316063 #

> Driving is not a general problem, though.

But what's driving a car? A generalist human brain that has been trained for ~30 hours to drive a car.

replies(1): >>44316689 #

15. suddenlybananas ◴[19 Jun 25 07:14 UTC] No.44316272[source]▶

>>44315438 #

That's absolutely not what makes humans great drivers?

replies(1): >>44332811 #

16. yokto ◴[19 Jun 25 07:31 UTC] No.44316357{3}[source]▶

>>44314866 #

This is (in part) what "world models" are about. While some companies like Tesla bring together a fleet of small specialised models, others like CommaAI and Wayve train generalist models.

17. shakna ◴[19 Jun 25 08:46 UTC] No.44316689{7}[source]▶

>>44316240 #

Human brain's aren't generalist!

We have multiple parts of the brain that interact in vastly different ways! Your cerebellum won't be running the role of the pons.

Most parts of the brain cannot take over for others. Self-healing is the exception, not the rule. Yes, we have a degree of neuroplasticity, but there are many limits.

(Sidenote: Driver's license here is 240 hours.)

replies(3): >>44317314 #>>44317648 #>>44319940 #

18. impossiblefork ◴[19 Jun 25 10:41 UTC] No.44317304[source]▶

>>44315438 #

I don't think that would have worked either.

But if they'd gone for radars and lidars and a bunch of sensors and then enough processing hardware to actually fuse that, then I think they could have built something that had a chance of working.

replies(1): >>44334093 #

19. Zanfa ◴[19 Jun 25 10:43 UTC] No.44317314{8}[source]▶

>>44316689 #

> Human brain's aren't generalist!

What? Human intelligence is literally how AGI is defined. Brain’s physical configuration is irrelevant.

replies(1): >>44318522 #

20. azan_ ◴[19 Jun 25 11:34 UTC] No.44317648{8}[source]▶

>>44316689 #

> We have multiple parts of the brain that interact in vastly different ways!

Yes, and thanks to that human brains are generalist

replies(1): >>44318510 #

21. TeMPOraL ◴[19 Jun 25 12:36 UTC] No.44318075{6}[source]▶

>>44316063 #

It partially is. You have the specialized part of maneuvering a fast moving vehicle in physical world, trying to keep it under control at all times and never colliding with anything. Then you have the general part, which is navigating the human environment. That's lanes and traffic signs and road works and schoolbuses, that's kids on the road and badly parked trailers.

Current breed of autonomous driving systems have problems with exceptional situations - but based on all I've read about so far, those are exactly of the kind that would benefit from a general system able to understand the situation it's in.

replies(1): >>44323579 #

22. mannicken ◴[19 Jun 25 12:37 UTC] No.44318089{5}[source]▶

>>44315369 #

Speed and Moore's law. You don't need to just make a decision without hallucinations, you need to do it fast enough for it to propagate to the power electronics and hit the gas/brake/turn the wheel/whatever. Over and over and over again on thousands of different tests.

A big problem I am noticing is that the IT culture over the last 70 years has existed in a state of "hardware gun get faster soon". And over the last ten years we had a "hardware cant get faster bc physics sorry" problem.

The way we've been making software in the 90s and 00s just isn't gonna be happening anymore. We are used to throwing more abstraction layers (C->C++->Java->vibe coding etc) at the problem and waiting for the guys in the fab to hurry up and get their hardware faster so our new abstraction layers can work.

Well, you can fire the guys in the fab all you want but no matter how much they try to yell at the nature it doesn't seem to care. They told us the embedded c++-monkeys to spread the message. Sorry, the moore's law is over, boys and girls. I think we all need to take a second to take that in and realize the significance of that.

[1] The "guys in the fab" are a fictional character and any similarity to the real world is a coincidence.

[2] No c++-monkeys were harmed in the process of making this comment.

23. shakna ◴[19 Jun 25 13:28 UTC] No.44318510{9}[source]▶

>>44317648 #

Only if that was a singular system, however, it is not. [0]

For example... The nerve cells in your gut may speak to the brain, and interact with it in complex ways we are only just beginning to understand, but they are separate systems that both have control over the nervous system, and other systems. [1]

General Intelligence, the psychological theory, and General Modelling, whilst sharing words, share little else.

[0] https://doi.org/10.1016/j.neuroimage.2022.119673

[1] https://doi.org/10.1126/science.aau9973

24. shakna ◴[19 Jun 25 13:30 UTC] No.44318522{9}[source]▶

>>44317314 #

A human brain is not a general model. We have multiple overlapping systems. The physical configuration is extremely relevant to that.

AGI is defined in terms of "General Intelligence", a theory that general modelling is irrelevant to.

25. yusina ◴[19 Jun 25 16:06 UTC] No.44319940{8}[source]▶

>>44316689 #

240 hours sounds excessive. Where is "here"?

26. tshaddox ◴[19 Jun 25 23:50 UTC] No.44323579{7}[source]▶

>>44318075 #

Yes, that’s exactly what I meant. I’d go even further and say the hard parts of driving are the parts where you are likely better off with a general model. And it’s not just signs, construction, police stopping traffic, etc. Even just basic navigation amongst traffic seems to require a general model of the other nearby drivers. It’s important to be able to model drivers’ intentions, and also to drive your own car in a predictable manner.

27. ActorNightly ◴[20 Jun 25 22:45 UTC] No.44332810{3}[source]▶

>>44315737 #

Dunno what Waymo uses, but they definitely work in 3d space as a start, rather than trying to map sequences of pictures to action. They also need training on specific areas.

28. ActorNightly ◴[20 Jun 25 22:46 UTC] No.44332811{3}[source]▶

>>44316272 #

Enlighten me please.

29. ActorNightly ◴[21 Jun 25 02:43 UTC] No.44334078{3}[source]▶

>>44315714 #

I didn't say use Mu Zero end to end, I said leverage it.

This is how I would do it:

First, you come up with a compressed representation of the state space of the terrain + other objects around your car that encodes the current states of everything, and its predicted evolution like ~5 seconds into the future.

The idea is that you would leverage physics, which means objects need to behave according to laws of motion, so this means you can greatly compress how this is represented. For example, a meshgrid of "terrain" other than empty road that is static, lane lines representing the road, and 3d boxes representing moving objects with a certain mass, with initial 6 dof state (xyz position, orientation), intial 6dof velocities, and 6 dof forcing functions with parameter of time that represent how these objects move.

So given this representation, you can write a program that simulates the evolution of the state space given any initial condition, and essentially simulate collisions.

Then you divide into 3 teams.

1st team trains a model to translate sensor data into this state space representation, with continuous updates on every cycle, leveraging things like Kalman filtering because of the correlation of certain things that leads to better accuracy. Overall you would get something where things like red brake lights would lead to deceleration forcing functions.

(If you wanted to get fancy, instead of a simulation, you build out probability space instead. I.e when you run the program, it would spit out a heat map of where certain objects are more likely to end up)

2nd team trains a model on real world traffic to find correlations between forcing functions of vehicles. I.e if a car slows down, the cars behind it would slow down. You could do this kinda like Tesla did - equip all your cars with sensors, assume driver inputs as the forcing function, observe the state space change given the model from team 1.

3nd team trains a Mu Zero like model given the 2 above. Given a random initial starting state, the "game" is to chose the sequence of accelerations, decelerations, and steering (quantized with finite values) that gets the highest score by a) avoiding collision b) following traffic laws, c) minimizing disturbance to other vehicles, and d) maximizing space around your own vehicle.

What all of this does is allow the model to compute not only expected behavior, but things that are realistically possible. For example, in a situation where collision is imminent, like you sitting at a red stop light, and the sensors detect a car rapidly approaching, the model would make a decision to drive into the intersection when there are no cars present to avoid getting rear ended, which is quantifiably way better than average human.

Furthermore, the models from team 2 and 3 can self improve real time, which is equivalent to humans getting used to driving habits of others in certain areas. You simply to batch training runs to improve prediction capability of other drivers. Then when your policy model makes a correct decision, you build a shortcut into the MCTS that lets you know that this works, which then means in the finite time compute span, you can search away from that tree for a more optimal solution, and if you don't find it, you already have the best one that works, and next time you search even more space. So essentially you get a processing speed up the more you use it.

30. ActorNightly ◴[21 Jun 25 02:45 UTC] No.44334093{3}[source]▶

>>44317304 #

Think about this. If I give you GTA 5 traffic in single player with only NPC drivers, could you manually write a policy that gets a player from point a to point b in a car, assuming you have in game positions of all cars?