I'm curious if this probing of nullability could be composed with other LLM/ML-based python-typing tools to improve their accuracy.
Maybe even focusing on interfaces such as nullability rather than precise types would work better with a duck-typed language like python than inferring types directly (i.e we don't really care if a variable is an int specifically, but rather that it supports _add or _sub etc. that it is numeric).
"LLM" is a valid category of thing in the world, but it's not a thing like Microsoft Outlook that has well-defined capabilities and limitations. It's frustrating reading these discussions that constantly devolve into one person saying they tried something that either worked or didn't, then 40 replies from other people saying they got the opposite result, possibly with a different model, different version, slight prompt altering, whatever it is.
LLMs possibly have the capability to understand nullability, but that doesn't mean every instance of every model will consistently understand that or anything else. This is the same way humans operate. Humans can run a 4-minute mile. Humans can run a 10-second 100 meter dash. Humans can develop and prove novel math theorems. But not all humans, not all the time, performance depends upon conditions, timing, luck, and there has probably never been a single human who can do all three. It takes practice in one specific discipline to get really good at that, and this practice competes with or even limits other abilities. For LLMs, this manifests in differences with the way they get fine-tuned and respond to specific prompt sequences that should all be different ways of expressing the same command or query but nonetheless produce different results. This is very different from the way we are used to machines and software behaving.
Any comparison to the human brain is missing the point that an LLM only simulates one small part, and that's notably not the frontal lobe. That's required for intelligence, reasoning, self-awareness, etc.
So, no, it's not a question of philosophy. For an AI to enter that realm, it would need to be more than just an LLM with some bells and whistles; an LLM plus something else, perhaps, something fundamentally different which does not yet currently exist.
But that 1% is pretty important.
For example, they are dismal at math problems that aren't just slight variations of problems they've seen before.
Here's one by blackandredpenn where ChatGPT insisted the solution to problem that could be solved by high school / talented middle school students was correct, even after trying to convince it it was wrong. https://youtu.be/V0jhP7giYVY?si=sDE2a4w7WpNwp6zU&t=837
Rewind earlier to see the real answer
The best way to predict the weather is to have a model which approximates the weather. The best way to predict the results of a physics simulation is to have a model which approximates the physical bodies in question. The best way to predict what word a human is going to write next is to have a model that approximates human thought.
Too many people these days are forgetting this key point and putting a dangerous amount of faith in ChatGPT etc. as a result. I've seen DOCTORS using ChatGPT for diagnosis. Ignorance is scary.
Please, I'm begging you, go read some papers and watch some videos about machine learning and how LLMs actually work. It is not "thinking."
I fully realize neural networks can approximate human thought -- but we are not there yet, and when we do get there, it will be something that is not an LLM, because an LLM is not capable of that -- it's not designed to be.
But even if we ignore that subtlety, it's not obvious that training a model to predict the next token doesn't lead to a world model and an ability to apply it. If you gave a human 10 physics books and told them that in a month they have a test where they have to complete sentences from the book, which strategy do you think is more successful: trying to memorize the books word by word or trying to understand the content?
The argument that understanding is just an advanced form of compression far predates LLMs. LLMs clearly lack many of the facilities humans have. Their only concept of a physical world comes from text descriptions and stories. They have a very weird form of memory, no real agency (they only act when triggered) and our attempts at replicating an internal monologue are very crude. But understanding is one thing they may well have, and if the current generation of models doesn't have it the next generation might
When challenged, everybody becomes an eliminative materialist even if it's inconsistent with their other views. It's very weird.
I'd be really curious to see where the "attention" heads of the LLM look when evaluating the nullability of any given token. Does just it trust the Optional[int] return type signature of the function, or does it also skim through the function contents to understand whether that's correct?
It's fascinating to me to think that the senior developer skillset of being able to skim through complicated code, mentally make note of different tokens of interest where assumptions may need to be double-checked, and unravel that cascade of assumptions to track down a bug, is something that LLMs already excel at.
Sure, nullability is an example where static type checkers do well, and it makes the article a bit silly on its own... but there are all sorts of assumptions that aren't captured well by type systems. There's been a ton of focus on LLMs for code generation; I think that LLMs for debugging makes for a fascinating frontier.
We didn't "find" AI, we invented systems that some people want to call AI, and some people aren't convinced it meets the bar
It is entirely reasonable for people to realize we set the bar too low when it is a bar we invented
I believe that the type of understanding demonstrated here doesn't. Consciousness only comes into play when we become aware that such understanding has taken place, not on the process itself.
Who is "we"?
I think of "AI" as a pretty all-encompassing term. ChatGPT is AI, but so is the computer player in the 1995 game Command and Conquer, among thousands of other games. Heck, I might even call the ghosts in Pac-man "AI", even if their behavior is extremely simple, predictable, and even exploitable once you understand it.
It would be good to list a few possible ways of interpreting 'understanding of code'. It could possibly include: 1) Type inference for the result 2) nullability 3) runtime asymptotics 4) What the code does
The goalposts already differentiated between "totally human-like" vs "actually conscious"
See also Philosophical Zombie thought experiment from the 70s.
I know plenty of teachers who would describe their students the exact same way. The difference is mostly one of magnitude (of delta in competence), not quality.
Also, I think it's important to note that by "could be solved by high school / talented middle school students" you mean "specifically designed to challenge the top ~1% of them". Because if you say "LLMs only manage to beat 99% of middle schoolers at math", the claim seems a whole lot different.
The current LLM anthropomorphism may soon be known as the silicon Turk. Managing to make people think they're AI.
They literally believe that the AI will supersede the scientific process. It's crypto shit all over again.
I think it will be very similar in architecture.
Artificial neural networks already are approximating how neurons in a brain work, it's just at a scale that's several orders of magnitude smaller.
Our limiting factor for reaching brain-like intelligence via ANN is probably more of a hardware limitation. We would need over 100 TB to store the weights for the neurons, not to mention the ridiculous amount of compute to run it.
Anything less than that, meh.
We don’t even have a good way to quantify human ability. The idea that we could suddenly develop a technique to quantify human ability because we now have a piece of technology that would benefit from that quantification is absurd.
That doesn’t mean we shouldn’t try to measure the ability of an LLM. But it does mean that the techniques used to quantify an LLMs ability are not something that can be applied to humans outside of narrow focus areas.
Call it AI, call it LLMs, whatever
Just as long as we continue to recognize that it is a tool that humans can use, and don't start trying to treat it as a human, or as a life, and I won't complain
I'm saving my anger for when idiots start to argue that LLMs are alive and deserve human rights
Tony Hoare called it "a billion-dollar mistake" (https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retra...), Rust had made core design choices precisely to avoid this mistake.
In practical AI-assisted coding in TypeScript I have found that it is good to add in Cursor Rules to avoid anything nullable, unless it is a well-designed choice. In my experience, it makes code much better.
You can say this is not 'real' understanding but you like many others will be unable to clearly distinguish this 'fake' understanding from 'real' understanding in a verifiable fashion, so you are just playing a game of meaningless semantics.
You really should think about what kind of difference is supposedly so important yet will not manifest itself in any testable way - an invented one.
Look I’m no stranger to love. you know the rules and so do I… you can’t find this conversation with any other guy.
But since the parent was making a meta commentary on this conversation I’d like to introduce everyone here as Kettle to a friend of mine known as #000000
From: ‘Keep training it, though, and eventually it will learn to insert the None test’
To: ‘Keep training it, though, and eventually the probability of inserting the None test goes up to xx%’
The former is just horse poop, we all know LLMs generate big variance in output.
The LLM will not understand, and is incapable of developing an understanding, of a concept not present in its training data.
If try to teach it the basics of the misunderstood concept in your chat, it will reflect back a verbal acknowledgement, restated in different words, with some smoothly worded embellishments which looks like the external trappings of understanding. It's only a mirage though.
The LLM will code anything, no matter how novel, if you give it detailed enough instructions and clarifications. That's just a a language translation task from pseudo-code to code. Being a language model, it's designed for that.
LLM is like the bar waiter who has picked up on economics and politics talk, and is able to interject with something clever sounding, to the surprise of the patrons. Gee, how does he or she understand the workings of the international monetary fund, and what the hell are they doing working in this bar?
But in Typescript, who cares? You’d be forced to handle null the same way you’d be forced to handle Maybe<T> = None | Just<T> except with extra, unidiomatic ceremony in the latter case.
Nobody interrogates each other's internal states when judging whether someone understands a topic. All we can judge it based on are the words they produce or the actions they take in response to a situation.
The way that systems or people arrive at a response is sort of an implementation detail that isn't that important when judging whether a system does or doesn't understand something. Some people understand a topic on an intuitive, almost unthinking level, and other people need to carefully reason about it, but they both demonstrate understanding by how they respond to questions about it in the exact same way.
https://chatgpt.com/share/67f40cd2-d088-8008-acd5-fe9a9784f3...
Double descent?
Because code LLMs have been trained on the syntactic form of the program and not its execution, it's not correct — even if the correlation between variable annotations and requested completions was perfect (which it's not) — to say that the model "understands nullability", because nullability means that under execution the variable in question can become null, which is not a state that it's possible for a model trained only on a million programs' syntax to "understand". You could get the same result if e.g. "Optional" means that the variable becomes poisonous and checking "> 0" is eating it, and "!= None" is an antidote. Human programmers can understand nullability because they've hopefully run programs and understand the semantics of making something null.
The paper could use precise, scientific language (e.g. "the presence of nullable annotation tokens correlates to activation of vectors corresponding to, and emission of, null-check tokens with high precision and accuracy") which would help us understand what we can rely on the LLM to do and what we can't. But it seems like there is some subconscious incentive to muddy how people see these models in the hopes that we start ascribing things to them that they aren't capable of.
What makes you think this? It has been trained on plenty of logs and traces, discussion of the behavior of various code, REPL sessions, etc. Code LLMs are trained on all human language and wide swaths of whatever machine-generated text is available, they are not restricted to just code.
Judging someone's external "involuntary cues" is not interrogating their internal state. It is, as you said, judging their response (a synonym for "answer") - and that judgment is also highly imperfect.
(It's worth noting that focusing so much on the someone's body language and tone that you ignore the actual words they said is a communication issue associated with not being on the spectrum, or being too allistic)
One of the very first tests I did of ChatGPT way back when it was new was give it a relatively complex string manipulation function from our code base, strip all identifying materials from the code (variable names, the function name itself, etc), and then provide it with inputs and ask it for the outputs. I was surprised that it could correctly generate the output from the input.
So it does have some idea of what the code actually does not just syntax.
A human would probably say "I don't know how to solve the problem". But ChatGPT free version is confidentially wrong ..
The critiques of mental state applied to the LLM's are increasingly applicable to us biologicals, and that's the philosophical abyss we're staring down.
I know how LLMs work; so let me beg you in return, listen to me for a second.
You have a theoretical-only argument: LLMs do text prediction, and therefore it is not possible for them to actually think. And since it's not possible for them to actually think, you don't need to consider any other evidence.
I'm telling you, there's a flaw in your argument: In actuality, the best way to do text prediction is to think. An LLM that could actually think would be able to do text prediction better than an LLM that can't actually think; and the better an LLM is able to approximate human thought, the better its predictions will be. The fact that they're predicting text in no way proves that there's no thinking going on.
Now, that doesn't prove that LLMs actually are thinking; but it does mean that they might be thinking. And so you should think about how you would know if they're actually thinking or not.
Would be fun if they also „cancelled the nullability direction“.. the llms probably would start hallucinating new explanations for what is happening in the code.
The code is entirely wrong. That validates something that's close to a NAPN number but isn't actually a NAPN number. In particular the area code cannot start with 0 nor can the central office code. There are several numbers, like 911, which have special meaning, and cannot appear in either position.
You'd get better results if you went to Stack Overflow and stole the correct answer yourself. Would probably be faster too.
This is why "non technical code writing" is a terrible idea. The underlying concept is explicitly technical. What are we even doing?
Certainly many ancient people worshiped celestial objects or crafted idols by their own hands and ascribed to them powers greater than themselves. That doesn't really help in the long run compared to taking personal responsibility for one's own actions and motives, the best interests of their tribe or community, and taking initiative to understand the underlying cause of mysterious phenomena.
They are not reasoning.
I would go much further than this; but this is a de minimus criteria that the LLM already fails.
What zealots eventually discover is that they can hold their "fanatical proposition" fixed in the face of all opposition to the contrary, by tearing down the whole edifice of science, knowledge, and reality itself.
If you wish to assert, against any reasonable thought, that the sky is a pink dome you can do so -- first that our eyes are broken, and then, eventually that we live in some paranoid "philosophical abyss" carefully constructed to permit your paranoia.
This abursidty is exhausting, and I'd wish one day to find fanatics who'd realise it quickly and abate it -- but alas, I have never.
If you find yourself hollowing-out the meaning of words to the point of making no distinctions, denying reality to reality itself, and otherwise arriving at a "philosophical abyss" be aware that it is your cherished propositions which are the maddness and nothing else.
Here: no, the LLM does not understand. Yes, we do. It is your job to begin from reasonable premises and abduce reasonable theories. If you do not, you will not.
Of course, now you can say "how do you know that our brains are not just efficient computers that run LLMs", but I feel like the onus of proof lies on the makers of this claim, not on the other side.
It is very likely that human intelligence is not just autocomplete on crack, given all we know about neuroscience so far.
So, what word would you propose we use to mean "an LLM's ability (or lack thereof) to output generally correct sentences about the topic at hand"?
How does the brain computes the weights then? Or maybe your assumption than brain is equivalent to a mathematical NN is wrong?
I'm having a great experience using Cursor, but i don't feel like trying to overhype it, it just makes me tired to see all this hype. Its a great tool, makes me more productive, nothing beyond that.
The initial LLMs simply lied about everything. If you happened to know something it was rather shocking but for topics you knew nothing about you got a rather convincing answer. Then the arms race begun and now the lies are so convincing we are at viable robot overlords.
Here's the Stanford Encyclopedia of Philosophy's superb write up: https://plato.stanford.edu/entries/chinese-room/ It covers, engagingly and with playful wit, not just Searle's original argument but its evolution in his writing, other philosopher's responses/criticisms, and Searle's counter-responses.
LLMs are perfectly capable of predicting the behavior of programs. You don't have to take my word for it, you can test it yourself. So he gave modal conditions they already satisfy. Can I conclude they understand now ?
>If you find yourself hollowing-out the meaning of words to the point of making no distinctions, denying reality to reality itself, and otherwise arriving at a "philosophical abyss" be aware that it is your cherished propositions which are the maddness and nothing else.
The only people denying reality are those who insist that it is not 'real' understanding and yet cannot distinguish this 'real' from 'fake' property in any verifiable manner, the very definition of an invented property.
Your argument boils down to 'I think it's so absurd so it cannot be so'. That's the best you can do ? Do you genuinely think that's a remotely convincing argument ?
Of course, ask it a PhD level question and it will confidently hallucinate more than Beavis & Butthead.
It really is a damn glorified autocomplete, unfortunately very useful as a search engine replacement.
It's very clear that LLMs lack understanding when you use them for anything remotely sophisticated. I say this as someone who leverages them extensively on a daily basis - mostly for development. They're very powerful tools and I'm grateful for their existence and the force multiplier they represent.
Try to get one to act as a storyteller and the limitations in understanding glare out. You try to goad some creativity and character out of it and it spits out generally insipid recombinations of obvious tropes.
In programming, I use AI strictly as a auto-complete extension. Even in that limited context, the latest models make trivial mistakes in certain circumstances that reveal their lack of understanding. The ones that stand out are the circumstances where the local change to make is very obvious and simple, but the context of the code is something that the ML hasn't seen before.
In those cases, I see them slapping together code that's semantically wrong in the local context, but pattern matches well against the outer context.
It's very clear that the ML doesn't even have a SIMPLE understanding of the language semantics, despite having been trained on presumably multiple billions of lines of code from all sorts of different programming languages.
If you train a human against half a dozen programming languages, you can readily expect by the end of that training that they will, all by themselves, simply through mastering the individual languages, have constructed their own internal generalized models of programming languages as a whole, and would become aware of some semantic generalities. And if I had asked that human to make that same small completion for me, they would have gotten it right. They would have understood that the language semantics are a stronger implicit context compared to the surrounding syntax.
MLs just don't do that. They're very impressive tools, and they are a strong step forward toward some machine model of understanding (sophisticated pattern matching is likely a fundamental prerequisite for understanding), but ascribing understanding to them at this point is jumping the gun. They're not there yet.
"Nobody interrogates each other's internal states when judging whether someone understands a topic. All we can judge it based on are the words they produce or the actions they take in response to a situation."
I'm fairly sure I wrote something that contradicts these two sentences.
You can commonly figure out whether people are leaking information about how they feel and what emotional and cognitive states they are in, even in text based communication. To some extent you can also decide the internal state of another person, and it's often easier to do with text than in direct communication, i.e. two bodies in the same space.
This idea that people aren't malleable and possible to interrogate without them noticing would be very surprising to anyone making a living from advertising.
2. LLMs regularly tell me if what I'm asking for is possible or not. I'm not saying they're always correct, but they seem to have at least some sense of what's in the realm of possibility.
You're mistaking the shadow on the cave wall for the thing itself. Inference is not the same as observation.
For the same reason that polygraphs are not lie detectors, the body language that you're convinced is "leaking" is not actually internal state.
No, but we know how it works and it is just a stochastic parrot. There is no magic in there.
What is more suprising to me that humans are so predictable that a glorified autocomplete works this well. Then again, it's not that suprising....
It might come as a surprise, but everything you've ever experienced is just shadows on the cave wall that is your cerebral cortex. You've never ever been outside of that cave and have always been at the mercy of a mammal brain preparing and projecting those shadows for you.
But even if knew how something works (which in present case we don't), shouldn't diminish our opinion of it. Will you have a lesser opinion of human intelligence, once we figure out how it works?
LLMs are cool, a lot of people find them useful. Hype bros are full of crap and there's no point arguing with them because it's always a pointless discussion. With crypto and nfts it's future predictions which are just inherently impossible to reason about, with ai it's partially that, and partially the whole "do they have human properties" thing which is equally impossible to reason about.
It gets discussed to death every single day.
We do know how LLMs work, correct? We also know what they are capable of and what not (of course this line is often blurred by hype).
I am not an expert at all on LLMs or neuroscience. But it is apparent that having a discussion with a human vs. with an LLM is a completely different ballpark. I am not saying that we will never have technology that can "understand" and "think" like a human does. I am just saying, this is not it.
Also, just because a lot of progress in LLMs has been made in the past 5 years, that we can just extrapolate the future progress on this. Local maxima and technology limits are a thing.
No, because even those 'sophisticated' examples still get very non trivial attempts. If I were to use the same standard of understanding we ascribe to humans, I would rarely class LLMs as having no understanding of some topic. Understanding does not mean perfection or the absence of mistakes, except in fiction and our collective imaginations.
>Try to get one to act as a storyteller and the limitations in understanding glare out. You try to goad some creativity and character out of it and it spits out generally insipid recombinations of obvious tropes.
I do and creativity is not really the issue with some of the new SOTA. I mean i understand what you are saying - default prose often isn't great and every single model besides 2.5-pro cannot handle details/story instructions for longform writing without essentially collapsing but it's not really creativity that's the problem.
>The ones that stand out are the circumstances where the local change to make is very obvious and simple
Obvious and simple to you maybe but with auto-complete, the context the model actually has is dubious at best. It's not like copilot is pasting all the code in 10 files if you have 10 files open. What actually gets in in the context of auto-complete is fairly beyond your control with no way to see what is getting the cut and what isn't.
I don't use auto-complete very often. For me, it doesn't compare to pasting in relevant code myself and asking for what I want. We have very different experiences.
2. That is not my experience. Almost half of the time LLM gives wrong answer without any warning. It is up to me to check correctness. Even if I follow up it often continues to give wrong answers.
It already has:
type value = string | null
People are pretty quiet about creative pursuits actually being the low hanging fruit on the AI tree.
We do have proofs that hallucination will always be a problem. We have proofs that the "reasoning" for models that "think" are actually regurgitation of human explanations written out. When asked to do truly novel things, the models fail. When asked to do high-precision things, the models fail. When asked to do high-accuracy things, the models fail.
LLMs don't understand. They are search engines. We are experience engines, and philosophically, we don't have a way to tokenize experience, we can only tokenize its description. So while LLMs can juggle descriptions all day long, these algorithms do so disconnected from the underlying experiences required for understanding.
1. Multi-step reasoning with backtracking when DeepSeek R1 was trained via GRPO.
2. Translation of languages they haven't even seen via in-context learning.
3. Arithmetic: heavily correlated with model size, but it does appear.
I could go on.
Albeit it's not an LLM, but a deep learning model trained via RL, would you say that AlphaZero's move 37 also doesn't count as emergence and the model has no understanding of Go?
NO! We have working training algorithms. We still don't have a complete understanding of why deep learning works in practice, and especially not why it works at the current level of scale. If you disagree, please cite me the papers because I'd love to read them.
To put in another way: Just because you can breed dogs, it doesn't necessary mean that you have a working theory of genes or even that you know they exist. Which was actually the human condition for most of history.
2. What are you arguing about? I didn't say they're always correct obviously.
To translate it to your analogy with dogs: We do know how the anatomy of dogs work, but we do not know why they sometimes fetch the stick and sometimes not.