An actual "thinking machine" would be constantly running computations on its accumulated experience in order to improve its future output and/or further compress its sensory history.
An LLM is doing exactly nothing while waiting for the next prompt.
I think the thing you were looking for was more along the lines of a persistent autonomous agent.
And telling me "just do both" is enforcing your world view and that is precisely what we're talking about _not_ doing.
I see thinking as less about "timing" and more about a "process"
What this post seems to be describing is more about where attention is paid and what neurons fire for various stimuli
HN is often characterized by a very negative tone related to any of these developments, but I really do feel that Anthropic is trying to do a “race to the top” in terms of alignment, though it doesn’t seem like all the other major companies are doing enough to race with them.
Particularly frustrating on HN is the common syllogism of: 1. I believe anything that “thinks” must do X thing. 2. LLM doesn’t do X thing 3. LLM doesn’t think
X thing is usually both poorly justified as constitutive of thinking (usually constitutive of human thinking but not writ large) nor is it explained why it matters whether the label of “thinking” applies to LLM or not if the capabilities remain the same.
Frankly this objection seems very weak
If we see LLMs as substantial compressed representations of human knowledge/thought/speech/expression—and within that, a representation of the world around us—then dictionary concepts that meaningfully explain this compressed representation should also share structure with human experience.
I don’t mean to take this canonically, it’s representations all the way down, but I can’t help but wonder what the geometry of this dictionary concept space says about us.
Consider a situation where you are teaching a child. She tries her best and makes a mistake on her math homework. Saying that her attempt was terrible because an adult could do better may be the "fullest truth" in the most eye-rolling banal way possible, and discourages her from trying in the future which is ultimately unproductive.
This "fullest truth" argument fails to take into account desire and motivation, and thus is a bad model of the truth.
>Used "dictionary learning"
>Found abstract features
>Found similar/close features using distance
>Tried amplifying and suppressing features
Not trying to be snary, but sounds mundane in the ML/LLM world. Then again, significant advances have come from simple concepts. Would love to hear from someone who has been able to try this out.
An LLM has no goals - it's just a machine optimized to minimize training errors, although I suppose you could view this as an innate hard-coded goal of minimizing next word error (relative to training set), in same way we might say a machine-like insect has some "goals".
Of course RLHF provides a longer time span (entire response vs next word) error to minimize, but I doubt training volume is enough for the model to internally model a goal of manipulating the listener as opposed to just favoring surface forms of response.
This is currently done with multiple LLMs and calls, not within the running of a single model i/o
Another example would be to input a single token or gibberish, the models we have today are more than happy to spit out fantastic numbers of tokens. They really only stop because we look for stop words they are trained to generate and we do the actual stopping action
it’s fine though, this was as productive as i expected
Still, what current LLMs are doing with their fixed rules is only a very limited form of reasoning since they just use a fixed N-steps of rule application to generate each word. People are looking to techniques such "group of experts" prompting to improve reasoning - step-wise generate multiple responses then evaluate them and proceed to next step.
but Karpathy was looking at very simple LSTMs of 1-3 layers, looking at individual nodes/cells, and these results have generally thus far been difficult to replicate among large scale transformers. Karpathy also doesn’t provide a recipe for doing this in his paper, which makes me think he was just guess and checking various cells. The representations discovered are very simple
While they're concerned with safety, I'm much more interested in this as a tool for controllability. Maybe we can finally get rid of the woke customer service tone, and get AI to be more eclectic and informative, and less watered down in its responses.
It's an interesting window on people's intuitions -- this pattern felt surprising and alien now to someone who imbibed Hofstadter and Dennett, etc., as a teen in the 80s.
(TBC, the surprise was not that people weren't sure they "think" or are "conscious", it's that they were sure they aren't, on this basis that the program is not running continually.)
I'm listing things that current LLMs cannot do (or things they do that thinking entities would not) to argue they are so simple they are far from anything that resembles thinking
> it’s fine though, this was as productive as i expected
A product of your replies becoming lowering in quality, and becoming more argumentative, so I will discontinue now
I worry this is going to come across as insulting, but that's not my intention. I do this too sometimes; I think everyone does. The point is we shouldn't define true reasoning so narrowly that we think no system capable of it would ever be caught doing what most of us are in fact doing most of the time.
One such example: The Internal State of an LLM Knows When It's Lying (https://arxiv.org/abs/2304.13734)
Searching phrases like "llm interpretability" and "llm activation analysis" uncover more
Damage part X of the network and see what happens. If the subject loses the ability to do Y, then X is responsible for Y.
Current LLMs have none of that - they are just the fixed set of rules, further limited by also having a fixed number of steps of rule application.
An LLM has no innate traits such as curiosity or boredom to trigger exploration, and anyways no online/incremental learning mechanism to benefit from it even if it did.
(drop-out was found to increase resilience in models because they had to encode information in the weights differently, i.e. could not rely on single neuron (at the limit))
I’m so fascinated by this stuff but I’m having trouble staying motivated in this short attention span world.
I’m a neophyte, so take this as such. If we can agree that people output is not always the product of thinking, then I’d be more willing to accept computational innovations as thought-like.
But simply by approximating human communication which often models goal oriented behavior, an LLM can have implicit goals. Which likely vary widely according to conversation context.
Implicit goals can be very effective. Nowhere in DNA is there any explicit goal to survive. However combinations of genes and markers selected for survivability create creatures with implicit goals to survive as tenacious as any explicit goals might be.
The effect is as if you had multiple people playing a game where they each extend a sentence by taking turns adding a word to it, but there is zero continuity from one word to the next because each person is starting from scratch when it is their turn.
When you look at a specific input, you can look to see what gets activated or not. Orthogonal but related ideas for inspecting the activations to see effects
What do you mean? They get to access their previous hidden states in the next greedy decode using attention, it is not simply starting from scratch. They can access exactly what they were thinking when they put out the previous word, not just reasoning from the word itself.
The prompt itself can trigger the features, so if you say "Try to weave in mentions of San Francisco" the San Francisco feature will be more activated in the response. But having a global equalizer could reduce drift as the conversation continued, perhaps?
But that's exactly what I'm saying - the model has access to what it was thinking when it generated the previous words, it does not start from scratch. If you don't have the KV cache, you still have to regenerate what it was thinking from the previous words so on the next word generation you can look back at what you were thinking from the previous words. Does that make sense? I'm not great at talking about this stuff in words
Indeed; to me LLMs pattern match (yes, I did spot the irony) to system-1 thinking, and they do a better job of that than we humans do.
Fortunately for all of us, they're no good at doing system-2 thinking themselves, and only mediocre at translating problems into a form which can be used by a formal logic system that excels at system-2 thinking.
Given how often China comes up in the context of AI, I'm wondering: Lots of people in the West treat China as mysterious and alien. I wonder how true that really is (e.g. Confucianism)? Or if it ever was (e.g. perhaps it used to be before industrialisation, which homogenises everyone regardless of the origin)?
Imagine taking Claude, tweaking weights relevant to X and then fine tuning it on knowledge related to X. It could result in more neurons being recruited to learn about X.
Imagine performing this during training to amplify or reduce the importance of certain topics. Train it on a vast corpus, but tune at various checkpoints to ensure the neural network's knowledge distribution skews. This could be a way to get more performance from MoE models.
I am not an expert. Just putting on my generalist hat here. Tell me I'm wrong because I'd be fascinated to hear the reasons.
There will be some overlap in what the model is now "thinking" (and has calculated from scratch) since the new prompt is one possible continuation of the previous one, but other things it was previously "thinking" will no longer be there.
e.g. Say the prompt was "the man", and output probabilities include "in" and "ran", reflecting the model thinking of potential continuations such as "the man in the corner" and "the man ran for mayor". Suppose the word sampled was "ran", so now the new prompt is "the man ran". Possible continuations can no longer include refining who the subject is, since the new word "ran" implies the continuation must now be an action.
There is some work that has been saved, per the KV cache, in processing the new prompt, but that is only things (self attention among the common part of the two prompts) that would not change if recalculated. What the model is thinking has changed, and will continue to change depending on the next sampled continuation ("the man ran for mayor", "the man ran for cover", "the man ran his bath", etc).
My observation is, and this may be more philosophical than technical: this process of "decomposing" middle-layer activations with a sparse autoencoder -- is it capturing accurately underlying features in the latent space of the network, or are we drawing order from chaos, imposing monosemanticity where there aren't any? Or to put it another way, were the features always there, learnt by training, or are we doing post-hoc rationalisations -- where the features exist because that's how we defined the autoencoders' dictionaries, and we learn only what we wanted to learn? Are the alien minds of LLMs truly also operating on a similar semantic space as ours, or are we reading tea leaves and seeing what we want to see?
Maybe this distinction doesn't even make sense to begin with; concepts are made by man, if clamping one of these features modifies outputs in a way that is understandable to humans, it doesn't matter if it's capturing some kind of underlying cluster in the latent space of the model. But I do think it's an interesting idea to ponder.
Over the next year or so I'm sure it will refine enough to be able to be more like a vector multiplier on activation, but simply flipping it on in general is going to create a very 'obsessed' model as stated.
I'll make a probably bad analogy: does your mindmap place things near each other like my mindmap?
To which I'd say, probably not, mindmaps are very personal, and the more complex we put on ours, the more personal and arbitrary they would be, and the less import the visuals would have
ex. if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.
I was pretty upset seeing the superalignment team dissolve at OpenAI, but as is typical for the AI space, the news of one day was quickly eclipsed by the next day.
Anthropic are really killing it right now, and it's very refreshing seeing their commitment to publishing novel findings.
I hope this finally serves as the nail in the coffin on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.
I don’t think this paper does much in the way of your final point, “it doesn’t understand what it’s saying”, though our understanding certainly has improved.
I find this statement... controversial?
The canonical example would be mathemathics - are they discovered or invented? Does the idea of '3' or an empty set or a straight line exist without any humans thinking about it or even if it is necessary to have any kind of an universe at all for these concepts to be valid? I think the answers here are 'yes' and 'no'.
Of course, there are still concepts which require grounding in the universe or humanity, but if you can think these up first (...somehow), you should need neither.
What kind of evidentiary threshold would you want if that's not sufficient?
More than that, I'd think a better 2D analogy for the latent space is a force-directed graph that you keep shaking as you add things to it. It doesn't seem unlikely for two such graphs, constructed in different order, to still end up identical in the end.
Thirdly:
> if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.
In 2D analogy, maybe, but that's because of limited space. In 20 000 D analogy, there's no reason for our mind maps to meaningfully differ here; there's enough dimensions that terms can be close to other terms for any relationship you could think of.
https://news.ycombinator.com/item?id=40242939
I love seeing the work here -- especially the way that they identified a vector specifically for bad code. I've been trying to explore the way that we can use adversarial training to increase the quality of code generated by our LLMs, and so using this technique to get countering examples of secure vs. insecure code (to bootstrap the training process) is really exciting.
Overall, fascinating stuff!!
Perhaps at some point LLMs will start to evolve from the prompt->response model into something more asynchronous and with some activity happening in the background too.
It would make sense for the human mental latent spaces to also converge. The reason is that the latent space exists to model the environment, which is largely shared among humans.
No matter what, there will always be a group of people saying that. The power and drive of the brain to convince itself that it is weaved of magical energy on a divine substrate shouldn't be underestimated. Especially when media plays so hard into that idea (the robots that lose the war because they cannot overcome love, etc.) because brains really love being told they are right.
I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.
https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transforme...
Basically finding that transformers don't just store a world-model as in "what does the world that produce the observed inputs look like?", they store a "Mixed-State Presentation", basically a weighted set of possible worlds that produce the observed inputs.
Yes there is.
If you think all training runs converge to the same bits given the same output size, I would again stress that the visual dimensions analogy is poetics and extremely tortured.
If you're making the weaker claim that generally concepts sort themselves into a space and they're generally sorted the same way if we have the same training data. Or rotational symmetry means any differences don't matter. Or location doesn't matter at all...we're in poetics.
Something that really sold me when I was in a similar mindset was word2vec's king - man + woman = queen wasn't actually real or in the model. Just a way of explaining it simply.
Another thought from my physics days: try visualizing 4D. Some people do claim to, after much effort, but in my experience they're unserious, i.e. I didn't see PhDs or masters students in my program claiming this. No one tries claiming they can see in 5D.
While this research allows us to interpret larger models in an amazing way, it doesn’t mean the models themselves ‘understand’ anything.
You can use this on much smaller scale models as well, as they showed 8 months ago. Does that research tell us about how models understand themselves? Or does it help us understand how the models work?
This seems like it's trivially true; if you find two different features for a concept in two different languages, just combine them and now you have a "multilingual feature".
Or are all of these features the same "size"? They might be and I might've missed it.
Yes, maths is an interesting (and open) question. But also, the rules of maths are the result of some set of axioms — it's not clear to me[1] that the axioms we have are necessarily the ones we must have, even though ours are clearly a really useful set.
We put labels onto the world to make it easier to deal with, but every time I look closer at any concept which has a physical reality associated with it, I find that it's unclear where the boundary should be.
What's a "word"? Does hyphenation or concatenation modify the boundary? What if it was concatenated in a different language and the meaning of the concatenation was loaned separately to the parts, e.g. "schadenfreude"? Was "Brexit" still a word before it was coined — and if yes then what else is, and if no then when did it become a word?
What's a "fish"? Dolphins are mammals, jellyfish have no CNS, molluscs glue themselves to a rock and digest their own brain.
What's a "species"? Not all mules are sterile.
Where's the cut-off between a fertilised human egg and a person? And on the other end, when does death happen?
What counts as "one" anglerfish, given the reproductive cycle has males attaching to and dissolving into the females?
There's only a smooth gradient with no sudden cut-offs going from dust to asteroids to minor planets to rocky planets to gas giants to brown dwarf stars.
There aren't really seven colours in the rainbow, and we have a lot more than five senses — there's not really a good reason to group "pain" and "gentle pressure" as both "touch", except to make it five.
[0] giving rise or likely to give rise to public disagreement
[1] however this is quite possibly due to me being wildly oblivious; the example I'd use is that one of Euclid's axioms turned out to be unnecessary, but so far as I am aware all the others are considered unavoidable?
> I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.
Indeed, though as we don't know what we're doing (and have 40 definitions of "consciousness" and no way to test for qualia), I would add that the first AI we make with these properties, will likely suffer from every permutation of severe and mild mental heath disorder that is logically possible, including many we have no word for because they would be incompatible with life if found in an organic brain.
The virus does not hate you, nor does it love you, but you are made of atoms which it can use for something else.
Models don't do that though, only if you run them in a loop with tools they can call, so mostly don't do that.
Or, in other words, I think absolute coordinates of any concept in the latent space are irrelevant and it makes no sense to compare them between two models; what matters is the relative position of concepts with respect to other concepts, and I expect the structures to be similar here for large enough datasets of real text, even if those data sets are disjoint.
(More specific prediction: take a typical LLM dataset, say Books3 or Common Crawl, randomly select half of it as dataset A, the remainder is dataset B. I expect that two models of the same architecture, one trained on dataset A, other on dataset B, should end up with structurally similar latent spaces.)
> Something that really sold me when I was in a similar mindset was word2vec's king - man + woman = queen wasn't actually real or in the model. Just a way of explaining it simply.
Huh, it seems I took the opposite understanding from word2vec: I expect that "king - man + woman = queen" should hold in most models. What I mean by structural similarity could be described as such equations mostly holding across models for a significant number of concepts.
Which exactly are we talking about here?
Because no, the research doesn't say much about the former, but yes, it says a lot about the latter, especially on top of the many, many earlier papers working in smaller toy models demonstrating world modeling.
That's also a description of DNA and RNA. They're chemicals, not magic.
And there's loads of people all too eager to put any and every AI they find into such an environment[0], then connect it to a robot body[1], or connect it to the internet[2], just to see what happens. Or have an AI or algorithm design T-shirts[3] for them or trade stocks[4][5][6] for them because they don't stop and think about how this might go wrong.
[0] https://community.openai.com/t/chaosgpt-an-ai-that-seeks-to-...
[1] https://www.microsoft.com/en-us/research/group/autonomous-sy...
[2] https://platform.openai.com/docs/api-reference
[3] https://www.theguardian.com/technology/2013/mar/02/amazon-wi...
[4] https://intellectia.ai/blog/chatgpt-for-stock-trading
[5] https://en.wikipedia.org/wiki/Algorithmic_trading
[6] https://en.wikipedia.org/wiki/2007–2008_financial_crisis
I don't think "AI safety" is the right abstraction because it came from the idea that AI would start off as an imaginary agent living in a computer that we'd teach stuff to. Whereas what we actually have is a giant pretrained blob that (unreliably) emits text when you run other text through it.
Constrained decoding (like forcing the answer to conform to JSON grammar) is an example of a real solution, and past that it's mostly the same as other software security.
I disagree, that's simply the behaviour one of the best consumer-facing AI that gets all the air-time at the moment. (Weirdly, loads of people even here talk about AI like it's LLMs even though diffusion based image generators are also making significant progress and being targeted with lawsuits).
AI is automation — the point is to do stuff we don't want to do for whatever reason (including expense), but it does it a bit wrong. People have already died from automation that was carefully engineered but which still had mistakes; machine learning is all about letting a system engineer itself, even if you end up making a checkpoint where it's "good enough", shipping that, and telling people they don't need to train it any more… though they often will keep training it, because that's not actually hard.
We've also got plenty of agentic AI (though as that's a buzzword, bleh, lots of scammers there too), independently of the fact that it's very easy to use even an LLM (which is absolutely not designed or intended for this) as a general agent just by putting it into a loop and telling it the sort of thing it's supposed to be agentic with regards to.
Even with constrained decoding, so far as I can tell the promises are merely advert, while the reality is that's these things are only "pretty good": https://community.openai.com/t/how-to-get-100-valid-json-ans...
(But of course, this is a fast-moving area, so I may just be out of date even though that was only from a few months ago).
However, the "it's only pretty good" becomes "this isn't even possible" in certain domains; this is why, for example, ChatGPT has a disclaimer on the front about not trusting it — there's no way to know, in general, if it's just plain wrong. Which is fine when writing a newspaper column because the Gell-Mann amnesia effect says it was already like that… but not when it's being tasked with anything critical.
Hopefully nobody will use ChatGPT to plan an economy, but the point of automation is to do things for us, so some future AI will almost certainly get used that way. Just as a toy model (because it's late here and I'm tired), imagine if that future AI decides to drop everything and invest only in rice and tulips 0.001% of the time. After all, if it's just as smart as a human, and humans made that mistake…
But on the "what about humans" perspective, you can also look at the environment. I'd say there's no evil moustache twirling villains who like polluting the world, but of course there are genuinely people who do that "to own the libs"; but these are not the main source of pollution in the world, mostly it's people making decisions that seem sensible to them and yet which collectively damage the commons. Plenty of reason to expect an AI to do something that "seems sensible" to its owner, which damages the commons, even if the human is paying attention, which they're probably not doing for the same reason M3 shareholders probably weren't looking very closely to what M3 was doing — "these people are maximising my dividend payments… why is my blood full of microplastics?"
Last week, the post about jailbreaking ChatGPT(?) talked about turning off a direction in possibility-space to disable the "I'm sorry, but I can't..." message.
In a regular program, it would be a boolean variable, or a single ASM instruction.
And you could ask the same thing. "How does my program have an off switch if there aren't enough values to store all possible meanings of "off"? Does my off switch variable map to your off switch variable?"
And the answer would be yes, or no, or it doesn't matter. It's a tool/construct.
- LLM Just got a whole set of buttons you can push. Potential for the LLM to push its own buttons?
- Read the paper and ctrl+f 'deplorable'. This shows once again how we are underestimating LLM's ability to appear conscious. It can be really effective. Reminiscence of Dr.Ford in Westworld :'you (robots) never look more human than when you are suffering.' Or something like that, anyway. I might be hallucinating dialogue but pretty sure something like that was said and I think it's quite true.
- Intensely realistic roleplaying potential unlocked.
- Efficiency by reducing context length by directly amplifying certain features instead.
Very powerful stuff. I am waiting eagerly when I can play with it myself. (Someone please make it a local feature)
- Given 2 word embedding sets,
- For each pair (A,B) of embeddings in one set,
- There exists an equivalence (A’,B’) in the other set,
- Such that dist(A,B) ≈ dist(A’, B’),
Something like that, to start. But would need to look at longer chains of relations.