Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

(www.youtube.com)

289 points sandslash | 2 comments | 01 Jul 25 14:00 UTC | HN request time: 0.414s | source

Show context

jandrewrogers ◴[03 Jul 25 05:56 UTC] No.44452056[source]▶

I appreciate the video and generally agree with Fei-Fei but I think it almost understates how different the problem of reasoning about the physical world actually is.

Most dynamics of the physical world are sparse, non-linear systems at every level of resolution. Most ways of constructing accurate models mathematically don’t actually work. LLMs, for better or worse, are pretty classic (in an algorithmic information theory sense) sequential induction problems. We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch.

There are a bunch of fundamental computer science problems that stand in the way, which I was schooled on in 2006 from the brightest minds in the field. For example, how do you represent arbitrary spatial relationships on computers in a general and scalable way? There are no solutions in the public data structures and algorithms literature. We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.

I vigorously agree that the ability to reason about spatiotemporal dynamics is critical to general AI. But the computer science required is so different from classical AI research that I don’t expect any pure AI researcher to bridge that gap. The other aspect is that this area of research became highly developed over two decades but is not in the public literature.

One of the big questions I have had since they announced the company, is who on their team is an expert in the dark state-of-the-art computer science with respect to working around these particular problems? They risk running straight into the same deep, layered theory walls that almost everyone else has run into. I can’t identify anyone on the team that is an expert in a relevant area of computer science theory, which makes me skeptical to some extent. It is a nice idea but I don’t get the sense they understand the true nature of the problem.

Nonetheless, I agree that it is important!

replies(24): >>44452139 #>>44452178 #>>44452230 #>>44452351 #>>44452367 #>>44452546 #>>44452772 #>>44453124 #>>44453326 #>>44453374 #>>44453649 #>>44453761 #>>44454793 #>>44454983 #>>44455580 #>>44456088 #>>44456308 #>>44456958 #>>44457201 #>>44457288 #>>44458172 #>>44458959 #>>44460100 #>>44463896 #

dopadelic ◴[03 Jul 25 17:14 UTC] No.44457201[source]▶

>>44452056 #

You're pointing out a real class of hard problems — modeling sparse, nonlinear, spatiotemporal systems — but there’s a fundamental mischaracterization in lumping all transformer-based models under “LLMs” and using that to dismiss the possibility of spatial reasoning.

Yes, classic LLMs (like GPT) operate as sequence predictors with no inductive bias for space, causality, or continuity. They're optimized for language fluency, not physical grounding. But multimodal models like ViT, Flamingo, and Perceiver IO are a completely different lineage, even if they use transformers under the hood. They tokenize images (or video, or point clouds) into spatially-aware embeddings and preserve positional structure in ways that make them far more suited to spatial reasoning than pure text LLMs.

The supposed “impedance mismatch” is real for language-only models, but that’s not the frontier anymore. The field has already moved into architectures that integrate vision, text, and action. Look at Flamingo's vision-language fusion, or GPT-4o’s real-time audio-visual grounding — these are not mere LLMs with pictures bolted on. These are spatiotemporal attention systems with architectural mechanisms for cross-modal alignment.

You're also asserting that "no general-purpose representations of space exist" — but this neglects decades of work in computational geometry, graphics, physics engines, and more recently, neural fields and geometric deep learning. Sure, no universal solution exists (nor should we expect one), but practical approximations exist: voxel grids, implicit neural representations, object-centric scene graphs, graph neural networks, etc. These aren't perfect, but dismissing them as non-existent isn’t accurate.

Finally, your concern about who on the team understands these deep theoretical issues is valid. But the fact is: theoretical CS isn’t the bottleneck here — it’s scalable implementation, multimodal pretraining, and architectural experimentation. If anything, what we need isn’t more Solomonoff-style induction or clever data structures — it’s models grounded in perception and action.

The real mistake isn’t that people are trying to cram physical reasoning into LLMs. The mistake is in acting like all transformer models are LLMs, and ignoring the very active (and promising) space of multimodal models that already tackle spatial, embodied, and dynamical reasoning problems — albeit imperfectly.

replies(2): >>44457327 #>>44458719 #

calf ◴[03 Jul 25 17:28 UTC] No.44457327[source]▶

>>44457201 #

How do we prove a trained LLM has no inductive bias for space, causality, etc.? We can't assume this is true by construction, can we?

replies(1): >>44457748 #

1. dopadelic ◴[03 Jul 25 18:12 UTC] No.44457748[source]▶

>>44457327 #

Why would we need to prove such a thing? Human vision has strong inductive biases, which is why you can perceive objects in abstract patterns. This is why you can lay down at a park and see a duck in a cloud. It's also why we can create abstracted representations of things with graphics. Having inductive biases makes it more relatable to the way we work.

And again, you're using the term LLMs again when vision based transformers in multimodal models aren't simply LLMs.

replies(1): >>44468642 #

2. calf ◴[04 Jul 25 22:58 UTC] No.44468642[source]▶

>>44457748 (TP) #

You said that classic LLMs have no inductive bias for causality. So I am simply asking if any computer scientist has actually proved that. Otherwise it is just a fancy way of saying "LLMs can't reason, they are just stochastic parrots". AFAIK not every computer scientist shares that consensus. So to use that claim is to potentially smuggle in an assumption that is not scientifically settled. That's why I specifically asked about this claim which you made a few paragraphs into your response to the parent commenter.

↑