I think Yann Lecun was right about LLMs (but perhaps only by accident)

Thanks for bringing this up! As far as I understand it current text diffusion models are limited to fairly short context windows. The idea of a text diffusion model continuously updating and revising a million-token-long chain-of-thought is pretty mind-boggling. I agree that these non-autoregressive models could potentially behave in completely different ways.

That said, I'm pretty sure we're a long way from building equally-competent diffusion-based base models, let alone reasoning models.

If anyone's interested in this topic, here are some more foundational papers to take a look at:

- Simple and Effective Masked Diffusion Language Models [2024] (https://arxiv.org/abs/2406.07524)

- Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [2023] (https://arxiv.org/abs/2310.16834)

- Diffusion-LM Improves Controllable Text Generation [2022] (https://arxiv.org/abs/2205.14217)