(lucalp.dev)

296 points todsacerdoti | 1 comments | 24 Jun 25 14:14 UTC | HN request time: 0.203s | source

Show context

cheesecompiler ◴[24 Jun 25 15:32 UTC] No.44367317[source]▶

The reverse is possible too: throwing massive compute at a problem can mask the existence of a simpler, more general solution. General-purpose methods tend to win out over time—but how can we be sure they’re truly the most general if we commit so hard to one paradigm (e.g. LLMs) that we stop exploring the underlying structure?

replies(4): >>44367776 #>>44367991 #>>44368757 #>>44375546 #

logicchains ◴[24 Jun 25 16:14 UTC] No.44367776[source]▶

>>44367317 #

We can be sure via analysis based on computational theory, e.g. https://arxiv.org/abs/2503.03961 and https://arxiv.org/abs/2310.07923 . This lets us know what classes of problems a model is able to solve, and sufficiently deep transformers with chain of thought have been shown to be theoretically capable of solving a very large class of problems.

replies(4): >>44367856 #>>44367945 #>>44369625 #>>44373799 #

1. tsimionescu ◴[25 Jun 25 05:12 UTC] No.44373799[source]▶

>>44367776 #

Note that these theorems show that there exists a transformer that can solve these problems, they tell you nothing about whether there is any way to train that transformer using gradient descent from some data, and even if you could, they don't tell you how much data and of what kind you would need to train them on.

↑

The bitter lesson is coming for tokenization