I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
To the extent we've already found that to be the case, it's perhaps the weirdest part of this whole "paradigm shift."
Then a system samples from that distribution, typically with randomness, and there are some optimizations in running them that introduce randomness, but it's important to understand that the models themselves are not random.
https://arxiv.org/abs/2402.14903
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.
It's best to assume that the relationship between input and output of an LLM is not deterministic, similar to something like using a Google search API.
but from time to time, doing this does require doing arithmetic correctly (to correctly add two exponents or whatever). so it would be nice to be able to trust that.
i imagine there are other uses for basic arithmetic too, QA applications over data that quotes statistics and such.
It sounds weird, but try writing your problem in LaTeX - I don’t know why, I’ve found a couple models to be incredibly capable at solving mathematical problems if you write them in LaTeX.
DAG architectures fundamentally cannot be AGI and you cannot even use them as a building block for a hypothetical AGI if they're immutable at runtime.
Any time I hear the goal being "AGI" in the context of these LLMs, I feel like listening to a bunch of 18th-century aristocrats trying to get to the moon by growing trees.
Try to create useful approximations using what you have or look for new approaches, but don't waste time on the impossible. There's no iterative improvements here that will get you to AGI.
Of course, the extra rules have to be logically consistent with the base S and K combinators, otherwise you will get wrong result. But if the inconsistent rule is complicated enough to be used only infrequently, you will still get correct result most of the time.
Which brings me to LLMs and transformers. I posit that transformers are essentially learned systems of rules that are applied to somewhat fuzzily known set of combinators (programs), each represented by a token (the term being represented by the embedding vector). However, the rules learned are not necessarily consistent (as it happens in the source data), so you get an occasional logical error (I don't want to call it hallucination because it's a different phenomenon from nondeterminism and extrapolation of LLMs).
This explains the collapse from the famous paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin... One infrequent but inconsistent rule is enough to poison the well due to logical principle of explosion. It also clearly cannot be completely fixed with more training data.
(There is also analogy to Terry Tao's stages of mathematical thinking: https://terrytao.wordpress.com/career-advice/theres-more-to-... Pre-rigorous corresponds to soomewhat random set of likely inconsistent logical rules, rigorous to small set of obviously consistent rules, like only S and K, and post-rigorous to a large set of rules that have been vetted for consistency.)
What is the "solution" to this? Well, I think during training you somehow need to make sure that the transformer rules learned by the LLM are logically consistent for the strictly logical fragment of the human language that is relevant to logical and programming problems. Which is admittedly not an easy task (I doubt it's even possible within NN framework).
> "You're not understanding, are you? The brain does the thinking. The meat."
> "Thinking meat! You're asking me to believe in thinking meat!"
https://www.mit.edu/people/dpolicar/writing/prose/text/think...
To draw an a analogy, we've got our human brain specialized.
Why not implement a part of the AI brain that's not neural nets, but instead circuitry specialized to math?
Maybe a dumb question since I'm a layperson!
LLMs might never be able to crunch numbers reliably, however I expect they should be very good at identifying the right formula and the inputs for a problem ("i need the answer to x*y, where x=12938762.3 and y=902832.2332"). Then they can call a math engine (calculator or wolfram alpha or whatever) to do the actual computation. That's what humans do anyway!
The Tree growing comment was a reference to another comment earlier in the comment chain.
The claim was that it isn't possible in principle for "DAGs" or "immutable architectures" to be intelligent. That statement is confusing some theoretical results that aren't applicable to how LLMs work (output context is mutation).
I'm not claiming that compute makes the m intelligent. I'm pointing out that it is certainly possible, and at that level of compute it should be plausible. Feel free to share any theoretical results you think demonstrate the impossibility of "DAG" intelligence and are applicable
Inthesamewaythatweusepunctuation. Or even that we usually order words a certain way, oranges and apples, Ted and Bill, roundabouts and swings.
The right alternative view is that it's an immutable function from prefixes to a distribution over all possible sequences of tokens less than (context_len - prefix_len).
There are no mutable functions that cannot be viewed as immutable in a similar way. Human brains are an immutable function from input sense-data to the combination (brain adaptation, output actions). Here "brain adaptation" doing a lot of work, but so would be "1e18 output tokens". There is much more information contained within the latter
The bitter lesson is that general methods and a system that learns trumps trying to manually embed/program human knowledge into the system, so clever architecture is ok and expected.
UPD: Found the paper: - https://huggingface.co/papers/2502.09741 - https://fouriernumber.github.io/
in paper mentioned “number” is a single sort-of “token” with numeric value, so network dealing with numbers like real numbers, separately from char representation. All the math happens directly on “number value”. In majority of current models numbers are handled like sequences of chars
It feels more interesting to note that this time, it is different. I've been watching the field since the 90s when I first dabbled in crude neural nets. I am informed there was hype before, but in my time I've never seen progress like we've made in the last five years. If you showed it to people from the 90s, it would be mind blowing. And it keeps improving incrementally, and I do not think that is going to stop. The state of AI today is the worst it will ever be (trivially obvious but still capable of shocking me).
What I'm trying to say is that the shocking success of LLMs has become a powerful engine of progress, creating a positive feedback loop that is dramatically increasing investment, attracting top talent, and sharpening the focus of research into the next frontiers of artificial intelligence.
90's? It's mind blowing to me now.
My daily driver laptop is (internally) a Thinkpad T480, a very middle of the road business class laptop from 2018.
It now talks to me. Usually knowledgeably, in a variety of common languages, using software I can download and run for free. It understands human relationships and motivations. It can offer reasonably advice and write simple programs from a description. It notices my tone and tries to adapt its manner.
All of this was inconceivable when I bought the laptop - I would have called it very unrealistic sci-fi. I am trying not to forget that.
so basically reverse notation has the advantage of keeping magnitude of numbers (digits!) relative to each other constant (or at least anchored to the beginning of the number)
doesn't attention help with this? (or, it does help, but not much? or it falls out of autoregressive methods?)