This, this is the problem for me with people deep in AI. They think it’s the end all be all for everything. They have the vision of the ‘AI’ they’ve seen in movies in mind, see the current ‘AI’ being used and to them it’s basically almost the same, their brain is mental bridging the concepts and saying it’s only a matter of time.
To me, that’s stupid. I observe the more populist and socially appealing CEOs of these VC startups (Sam Altman being the biggest, of course.) just straight up lying to the masses, for financial gain, of course.
Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.
and the enthusiasm of these people that are into it feeds into those who aren’t aware of it in the slightest, they see you can chat with a ‘robot’, they hear all this hype from their peers and they buy into it. We are social creatures after all.
I think using any of this in a national security setting is stupid, wasteful and very, very insecure.
Hell, if you really care about being ahead, pour 500 billion dollars into quantum computing so u can try to break current encryption. That’ll get you so much further than this nonsensical bs.
What about AI enabled drones and guided missiles/rockets? The case for their effectiveness is relatively simple in terms of jamming resistance.
Indeed. People are welcome to go "all in" on whatever nonsense gambling they want to do with their personal investments, but national security demands actually thinking about things - adversarially. Because the enemy will as well.
It's perfectly possible to lose a war by investing in expensive superweapons that under deliver. The Nazis were particularly bad at this.
If we're not talking about cyber war exclusively, such as finding and exploiting vulnerabilities, for the time being national security will still be based on traditional army.
Just a few weeks ago, italy announced a 16bln€ plan to buy >1000 rheinmetall ifv vehicles. That alone would make italy's army one of the most equipped in Europe. I can't imagine what would happen with a 500$bln investment in defense,lol. I don't agree with what Meloni's government is doing, but one of the ministers I agree more with is the defense minister Crosetto
Furthermore, what is being shown, at least for the time being, is that open source can be and is crucial in aiding developing better models. This collides with the idea of big, single "one winner takes it all" VC mentality (because let's be honest, these defense pitches are still made by startup/VC bros)
1) Have a camera on your drone 2) Run some frames through a locally running version of something like AWS Rekognition's celebrity identification service but for relevant military targets. 3) Navigate towards coordinates of target individuals
It isn't exactly magic, here's a video of a guy doing navigation with openCV on images: https://www.youtube.com/watch?v=Nrzs3dQ9exw
I think the ball is very much in their court to demonstrate they actually are using their massive compute in such a productive fashion. My BigTech experience would tend to suggest that frugality went out the window the day the valuation took off, and they are in fact just burning compute for little gain, because why not...
Yeah, I was thinking about this while trying to figure out author affiliations.
There was a Stanford paper a few years ago that dusted off some old intelligence concepts and the authors seemed excited about it.
But given the pace of AI, it's difficult to look in new directions. It will probably take an AI winter and some unbridled enthusiasm immune to burnout to make some real progress outside of feed forward neural networks.
Is (human-based) general intelligence not one of the fundamental enabling elements of literally every human activity throughout history, regardless of how many layers of automation and technology one has to peel back to get to it?
Can you maybe imagine how the ability to create arbitrary amounts of general intelligence, completely divorced from the normal lengthy biological process, could upend that foundation of human activity?
> They have the vision of the ‘AI’ they’ve seen in movies in mind, see the current ‘AI’ being used and to them it’s basically almost the same, their brain is mental bridging the concepts and saying it’s only a matter of time.
I've found that most AI-related movies exclusively focus on "quality ASI" scenarios, which are mostly irrelevant to our current state of the world, as an immense amount of danger/value/disruption will arrive with AGI. People who are seriously reasoning about the impacts of AGI are not using movies as references. "Those stupid movie watching idiots" is just a crutch you are using to avoid thinking about something that you disagree with.
> Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.
Do you have any evidence to support this conclusion? And does it even matter? If "fake intellect" can replace a human, that human still has to deal with the very real issue or not having a job anymore. If "fake intellect" is used to conduct mass surveillance, and direct suppression activities towards divergent individuals, those individuals are still going to have a bad time.
(sorry for the long quote)
I will say (naively perhaps) "oh but that is fairly simple". For any API request, add a counter of 5 seconds to the next for 'unverified' users. Make the "blue check" (a-la X/Twitter). For the 'big sales' have a third-party vetting process so that if US Corporation XYZ wants access, they prove themselves worthy/not Chinese competition and then you do give them the 1000/min deal.
For everyone else, add the 5 second (or whatever other duration makes sense) timer/overhead and then see them drop from 1000 requests per minutes to 500 per day. Or just cap them at 500 per day and close that back-door. And if you get 'many cheap accounts' doing hand-overs (AccountA does 1-500, AccountB does 501-1000, AccountC does 1001-1500, and so on) then you mass block them.
Remember that the Lisp AI-labs people were working on non-solved problems on absolute potatoes of computers back in the day, we have a semblance of progress solution but so much of it has been brute-force (even if there has been improvements in the field).
The big question is if these insane spendings has pulled the rug on real progress if we head into another AI winter of disillusionment or if there is enough real progress just around the corner to show that there is hope for investors in a post-deepseek valuation hangover.
The hard part now is relatively trivial. Does anyone think that there is a fundamental and profound discovery that evolution made purely by selection in the last 200,000 years? I mean a true qualitative difference?
Sure—-We call it language, which is just another part of a fancy animal’s tool kit.
Does anyone think there is an amazing qualitative difference between the brain of a chimp and the brain of a human?
No, not if they know any biology.
(Although that does not stop some scientist from looking for a “language gene” like FOXP2.)
So what did dumb mutations and 200,000 years of selection do that a group of dedicated AI scientists cannot do with their own genuine general intelligence?
Nothing—-nothing other than putting a compact energy efficient LLM with reinforcement learning on a good robotic body and letting it explore and learn like we did as infants, toddlers and teenagers.
Each one of us has experienced becoming a “general intelligence”. I remember it hit me on the head in 6th grade when I dreamed up a different way of doing long division. I remember thinking: “How did I think that?” And each one of us who has watched an infant turn into a toddler has watched it as an observer or teacher. This is what makes babies so fascinating to “play” with.
We have to give our baby AGI a private memory and a layer of meta-attention like we all gain as we mature, love, and struggle.
I read the linked article and as a neuroscientist I realized the “wait” cycles that improved performance so much is roughly equivalent to the prefrontal cortex: the part of the CNS most responsible for enabling us to check our own reasoning recursively. Delay—as in delayed gratification—-is a key attribute of intelligent systems.
We are finally on the door step to Hofstadter’s Strange Loop and Maturana’s and Valera’s “enactive” systems, but now implemented in silicon, metal, and plastic by us rather than dumb but very patient natural selection.
Karl Friston and Demis Hassabis (two very smart neuroscientist) figured this out years ago. And they were preceded by three other world class neuroscientist: Humberto Maturana, Francisco Valera, and Rich Sutton (honorary neuroscientist). And big credit to Terry Winograd for presaging this path forward long ago too.
I dismissed the X references to S1 without reading them, big mistake. I have been working generally in AI for 40 hears and neural networks for 35 years and the exponential progress since the hacks that make deep learning possible has been breathtaking.
Reduction in processing and memory requirements for running models is incredible. I have been personally struggling with creating my own LLM-based agents with weaker on-device models (my same experiments usually work with 4o-mini and above models) but either my skills will get better or I can wait for better on device models.
I was experimenting with the iOS/iPadOS/macOS app On-Device AI last night and the person who wrote this app was successful in combining web search tool calling working with a very small model - something that I have been trying to perfect.
> In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".
I had found a few days ago that this let you 'inject' your own CoT and jailbreak it easier. Maybe these are related?
If there is another AI winter, it will be more like the dotcom bubble: lots of important work got done in the dotcom bubble, but many of the big tech companies started from the fruits of that labor in the decade after the bubble burst
These reasoning models are feeding more to OP's last point about NVidia and OpenAI data centers not being wasted since reason models require more tokens and faster tps.
This is not "just" machine learning because we have never been able to do things which we are today and this is not only the result of better hardware. Better hardware is actually a byproduct. Why build a PFLOPS GPU when there is nothing that can utilize it?
If you spare yourself some time and read through the actual (scientific) papers of multiple generations of LLM models, the first one being from Google ~~not DeepMind~~ in 2017, you might get to understand that this is no fluff.
And I'm speaking this from a position of a software engineer, without bias.
The reason why all this really took off with so much hi-speed is because of the not quite expected results - early LLM experiments have shown that "knowledge" with current transformers architecture can linearly scale with regards to the amount of compute and training time etc. That was very unexpected and to this day scientists do not have an answer why this even works.
So, after reading bunch of material I am inclined to think that this is something different. The future of loading the codebase into the model and asking the model to explain me the code or fix bugs has never been so close and realistic. For the better or worse.
> Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.
That sounds to me like dismissing the idea that a Russian SSBN might cross the Pacific and nuke Los Angeles because "submarines can't swim".Even if the machine learning isn't really intelligent, it is still capable of performing IF..THEN..ELSE operations, which could have detrimental effects for [some subset of] humans.
And even if you argue that such a machine _shouldn't_ be used for whatever doomsday scenario would harm us, rest assured that someone, somewhere, who either does not understand what the machines are designed to do or just pretends that they work like magic, will put the machines in a position to make such a decision.
For all that critics of AI dismiss them as lacking imagination, your reaction suggests a lack of imagination.
Off the top of my head: facial recognition and identification to make "smart" guns that hit specific targets with reduced collateral damage (as found on most digital cameras even before smartphones); creating and A/B testing propaganda campaigns; using modified wifi signals as wall-penetrating radar capable of post estimation, heart rate and breathing monitoring[0]; take any self-driving car's AI and conditionally invert the part that says "don't hit pedestrians" when a certain target is spotted; ANPR to track specific vehicles with known owners over long distances; alternative targeting system for cruise missiles in the absence or jamming of GPS systems; using them as red teams in war-game exercises; using them to automate intrusion detection by monitoring for changes to background distributions of basically every measurable event; person-tracking by watching CCTV in secure areas; control systems for security robots (think Boston Dynamics' Spot) that are currently in deployment.
There's likely a lot more, too.
[0] https://openaccess.thecvf.com/content_cvpr_2018/papers/Zhao_...
"Is Data genuinely sentient or is he just a machine with this impression" was a repeated plot point in TNG.
https://en.wikipedia.org/wiki/The_Measure_of_a_Man_(Star_Tre...
https://en.wikipedia.org/wiki/The_Offspring_(Star_Trek:_The_...
https://en.wikipedia.org/wiki/The_Ensigns_of_Command
https://en.wikipedia.org/wiki/The_Schizoid_Man_(Star_Trek:_T...
Similar with The Doctor on VOY.
Even then, what we have with LLMs is basically already at the level of the ship's main computer as it was written in TNG/DS9/VOY.
But what counts as personhood? No idea. We're as ignorant as of that as the writers of the TV show; humanity's best attempts at philosophy of mind are still making the same kinds of arguments today as we were a century ago, the only real shift is to be slightly more inclusive of other species besides Homo Sapiens, and the only real improvement over two centuries ago is to include all of Homo Sapiens.
So target practice for a beyond-the-horizon missile system launched ground-to-ground or air-to-ground? As an attacking force, conventional ground forces and tactics are a non-runner in a modern theatre of operations when faced against air and drone support. This is why no single EU country is incentivised into dumping money into any single area - as the only probable defense would be against USA/Russia/China to begin with.
The US proved it beyond doubt in Afghanistan - partisans simply haven't a chance against a gunship with IR or NV optics; the last time they levelled the playing field against air interdictors was in Charlie Wilson's Afghanistan when the Mujahideen took on that era of Soviet gunships with hand-held AA systems.
> Do you have any evidence to support this conclusion? And does it even matter? If "fake intellect" can replace a human, that human still has to deal with the very real issue or not having a job anymore. If "fake intellect" is used to conduct mass surveillance, and direct suppression activities towards divergent individuals, those individuals are still going to have a bad time.
I think the "fake intelligence can replace a human" needs more support in general. We know how human intellect works practically (not theoretically) and we know how to apply it in different scenarios. We're still far from knowing how "fake intelligence" works and how to apply it to different scenarios.
This is ai and trend lines point to an intelligence that matches or barely exceeds human intellect in the future.
You’re part of a trend of people in denial. When LLMs first came out there were hordes of people on HN claiming it was just a stochastic parrot and LLMs displayed zero intellectual ability. It is now abundantly clear that this not true.
We don’t fully understand LLMs. That’s why gains like COT are just black box adjustments that come from changing external configurations. We have no way to read the contents of the black box and make adjustments off of it. Yet idiots like you can make such vast and hard claims when nobody really fully understands these things. You’re delusional.
I agree that LLMs won’t allow us to make some super weapon to give us some edge in national security.
If a human uses such a paradigm to success it is the same.
However, once that happens, you can run it (and any GGUF model) from Hugging Face![0]
S1 is fully supervised by distilling Gemini. R1 works by reinforcement learning with a much weaker judge LLM.
They don't follow the same scaling laws. They don't give you the same results. They don't have the same robustness. You can use R1 for your own problems. You can't use S1 unless Gemini works already.
We know that distillation works and is very cheap. This has been true for a decade; there's nothing here.
S1 is a rushed hack job (they didn't even run most of their evaluations with an excuse that the Gemini API is too hard to use!) that probably existed before R1 was released and then pivoted into this mess.
Are they actually, though? Presently yes, but are they actually driving ROI? Or just an asset nobody really is meaningfully utilizing, but helps juice the stocks?
LLMs are changing how I see reality.
I believe data and the doctor both would be people to me. But is minuet? The woman created by the binars to distract riker so they could steal the ship? Star Trek goes back and forth as to who counts as a person or not. This is one of the points that is made in the episode where the doctor learns his hologram is cloned many times to work as miners on an asteroid and he says this is slavery. Does minuet get to say she doesn’t want to seduce riker if she doesn’t feel like it? Or is she a program that does whatever the programmer wants?
But moreover linking back to the original technology, does eqtransformer have awareness? Do we treat it as a thinking being that we enslaved to find earthquakes in data because it uses the same technology as ChatGPT? I don’t think anyone thinks we should do this but it seems like that is where people want to take it. That’s strange to me. At what point does data pipelines doing some kind of stochastic transformation and summarization of training data become an individual that presents a desire for autonomy like data or the doctor?
I think there’s lots of questions here to answer and I don’t know the answers to them.
Not having tanks is a big no no due to latest events, I do not see the grudge you hold against them when it's clear they wouldn't be used in the same flat scenario of Ukraine or Poland. Given that Italy is highly mountainous, it made sense to prioritize air force and navy first. I think they're also compatible for anti drone guns
Piaggio air space was acquired a few weeks ago by baykar, hopefully it will produce drones.
I do not get why you're saying that a change in the approach of war makes tank not useful anymore, when it's clear they will eventually be adapted to counter drones and similar threats
I’d go so far as to add one more layer to monitor this one and stop adding layers. My thinking is that this meta awareness is all you need.
No data to back my hypothesis up. So take it for what it’s worth.
Whatever you want to call this “reasoning” step, ultimately it really is just throwing the model into a game loop. We want to interact with it on each tick (spin the clay), and sculpt every second until it looks right.
You will need to loop against an LLM to do just about anything and everything, forever - this is the default workflow.
Those who think we will quell our thirst for compute have another thing coming, we’re going to be insatiable with how much LLM brute force looping we will do.
Wait, actually 1 + 1 equals 1.
A little context about you:
- person
- has hands, reads HN
These few state variables are enough to generate a believable enough frame in your rendering.
If the rendering doesn’t look believable to you, you modify state variables to make the render more believable, eg:
Context:
- person
- with hands
- incredulous demeanor
- reading HN
Now I can render you more accurately based on your “reasoning”, but truly I never needed all that data to see you.
Reasoning as we know it could just be a mechanism to fill in gaps in obviously sparse data (we absolutely do not have all the data to render reality accurately, you are seeing an illusion). Go reason about it all you want.
And thus, by this logic, every company in the world should just be buying as many servers as they can get their hands on, because More Servers = More Value.
Obviously, this is not happening. It doesn't take much analysis to start listing the many and manifold reasons why. Many of those reasons will apply to GPUs as well. Just as if everything in AWS got 10x faster, overnight, this would not create a situation where everyone suddenly starts grabbing more servers in AWS. Obviously everyone would start trimming down, even if perhaps in a few years time they'd find some way to use this burst of power such that they can use more later. This can't happen overnight, though. It would take time, and not "weeks" or "months" but "years" at scale.
Incorporating the important variable of time in the analysis, if AIs become literally hundreds of times cheaper to run, today, then it is perfectly logical that the near-term demand for the hardware to run them is also going to go way, way down. However much potential AI may have, it is fairly clear looking out at the AI landscape right now that there isn't really anyone out there unlocking vast amounts of value and sitting there wringing their hands because they just can't get more GPU compute. The GPU rush has been from fear that someone will figure out how to "really" unlock AI and then they'll be stuck without the hardware to compete.
It may be the case that vastly cheaper AI will in fact be part of unlocking that value, and that as the AI industry grows it will grow faster as a result... but that's still going to be on a multi-year time frame, not a tomorrow time frame. And all those GPUs and all those valuations are still broadly based on them being valuable real soon now, not in a few years, and all those GPU purchases are on the assumption they need them now, or on a timeframe where we can't be waiting around, rather than waiting for some rounds of exponential doublings to bring price down. The hardware curve in 5 years may be higher but the curve in the next year would be lower, and by a lot.
And, you know, who's to say we're done? I doubt there's another 100x in there, but is someone going to eke out another 2x improvement? Or a 10x improvement? Making it easier to run lots of experiments makes it much more likely for that to happen. I'm skeptical of another 10x general improvement but 10x improvements for specific, important use cases I can't rule out.
Edit: I should also point out this is an extremely common pattern in technology in general. Often the very hardest part is producing a thing that does a particular task at all. Once we have it in hand, once we can use it and learn how it operates and what its characteristic operating modes are, once we can try modifications to it in the real world and see what happens, optimizing it becomes much easier, sometimes explosively so by comparison. Taking any first iteration of a tech that is practical and then trying to straight-line demand based on it is silly, in all sorts of ways and all directions. The internal combustion engine, for example, has had a myriad of impacts on the world and certainly after various improvements many, many millions if not billions of them have been made... but any company that reacted to the first couple of cars and just went ballistic buying those first-generation internal combustion engines would have lost everything, and rather quickly.
this is just standard decoding, the stream of vectors is called the k/v cache
If not: what am I intended to take away from this? What is its relevance to my comment?
It's also so not far from Meta's large concept model idea.
The $90,000 bag of bushings becomes a $300,000 bag?
? what paper are you talking about
The mechanism is never explained, but no, it doesn't always understand correctly — and neither does Data. If hologram-Moriarty is sentient (is he?), then the capability likely exceeds what current LLMs can do, but the cause of the creation is definitely a misunderstanding.
Even the episode where that happens, the script for Dr. Pulaski leading up to Moriarty's IQ boost was exactly the same arguments used against LLMs: https://www.youtube.com/watch?v=4pYDy7vsCj8
(Common trope in that era being that computers (including Data) are too literal, so there was also: https://www.youtube.com/watch?v=HiIlJaSDPaA)
Similar with every time the crew work iteratively to create something in the holodeck. And, of course: https://www.youtube.com/watch?v=srO9D8B6dH4
> I do not think that resembles the current crop of LLMs. On voyager the ships computer is some kind of biological computing entity that they eventually give up on as a story topic but there is an episode where the bio computing gel packs get sick.
"Take the cheese to sickbay" is one of my favourite lines from that series.
> But is minuet?
I would say the character was a puppet, with the Bynars pulling the strings, because the holo-character was immediately seen as lacking personhood the moment they stopped fiddling with the computer.
Vic Fontaine was more ambiguous in that regard. Knew he was "a lightbulb", but (acted like) he wanted to remain within that reality in a way that to me felt like he was *programmed* to respond as if the sim around him was the only reality that mattered rather than having free will in that regard.
(But who has total free will? Humans are to holograms as Q is to humans, and the main cast were also written to reject "gifts" from Riker that time he briefly became a Q).
The villagers of Fair Haven were, I think, not supposed to be sentient (from the POV of the crew), but were from the POV of the writers: https://en.wikipedia.org/wiki/Fair_Haven_(Star_Trek:_Voyager... and https://en.wikipedia.org/wiki/Spirit_Folk_(Star_Trek:_Voyage...
> does eqtransformer have awareness?
There's too many different definitions for a single answer.
We don't know what part of our own brains gives us the sensation of our own existence; and even if we did, we wouldn't know if it was the only mechanism to do so.
To paraphrase your own words:
At what point does chemical pipelines doing some kind of stochastic transformation and electrochemical integration of sensory input become an individual that presents a desire for autonomy like data or the doctor?
I don't know. Like you, I'd say:
> I think there’s lots of questions here to answer and I don’t know the answers to them.
[41 comments, 166 points] https://news.ycombinator.com/item?id=42919597
There are a finite amount of information stored in any large model, the models are really good at presenting the correct information back, and adding thinking blocks made the models even better at doing that. But there is a cap to that.
Just like how you can compress a file by a lot, there is a theoretical maximum to the amount of compression before it starts becoming lossy. There is also a theoretical maximum of relevant information from a model regardless of how long it is forced to think.
I just said "Hello!" and it went off the rails.
Couldn't they just start hiding the thinking portion?
It would be easy for them to do this. Currently, they already provide one sentence summaries for each step of the thinking I think users would be fine or at least stay if it were changed to provide only that.
For example, xenophobia as a response to economic hardship is the wrong chain of thinking embedded in the larger zeitgeist.
It is worth noting that the first "LLM" you referring to was only 300M parameters, but even then the amount of training required (at the time) was such that training a model like that outside of a big tech company was infeasible. Obviously now we have models that are in the hundreds of billions / trillions of parameters. The ability to train these models is directly a result of better / more hardware being applied to the problem as well as the Transformer architecture specifically designed to better conform with parallel computation at scale.
The first GPT model came out ~ 8 years ago. I recall when GPT-2 came out they initially didn't want to release the weights out of concern for what the model could be used for, looking back now that's kind of amusing. However, fundamentally, all these models are the same setup as what was used then, decoder based Transformers. They are just substantially larger, trained on substantially more data, trained with substantially more hardware.
I hope it gets tested further.
* The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based on very expensive oracles (if we believe DeepSeek-R1 queried OpenAI's model). If these are improvements on existing models, why is this being reported as decimating training costs? Isn't fine-tuning already a cheap way to optimize? (maybe not as effective, but still)
* The R1 paper talks about improving one simple game - Countdown. But the original models are "magic" because they can solve a nearly uncountable number of problems and scenarios. How does the DeepSeek / R1 approach scale to the same gigantic scale?
* Phrased another way, my understanding is that these techniques are using existing models as black-box oracles. If so, how many millions/billions/trillions of queries must be probed to replicate and improve the original dataset?
* Is anything known about the training datasets used by DeepSeek? OpenAI used presumably every scraped dataset they could get their hands on. Did DS do the same?
[[ ]] The story goes like this: Earth is captured by a technocapital singularity as renaissance rationalization and oceanic navigation lock into commoditization take-off. Logistically accelerating techno-economic interactivity crumbles social order in auto-sophisticating machine runaway. As markets learn to manufacture intelligence, politics modernizes, upgrades paranoia, and tries to get a grip.
> reasoning models which do AI researchIn the introduction to my research project on Accelerationism [0], I write:
Faced with the acceleration of progress in Artificial Intelligence (AI) — with AI agents now automating AI research and development —, Accelerationism no longer seems like an abstract philosophy producing empty hyperstitional hype, but like a sober description of reality. The failed 2023 memorandum to stop AI development on systems more powerful than OpenAI's ChatGPT-4 perfectly illustrates the phenomenological aspects of Accelerationism: "To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon." [1]
At the current rate of acceleration, if you don't write hyperstitionally, your texts are dead on arrival.[1] Nick Land (2017). A Quick-and-Dirty Introduction to Accelerationism in Jacobite Magazine.
Isn't it because we insist on only using the latest nodes from a single company for manufacture?
I don't understand why we can't use older process nodes to boost overall GPU making capacity.
Can't we have tiers of GPU availability?
Why is Nvidia not diversifying aggressively to Samsung and Intel no matter the process node.
Can someone explain?
I've heard packaging is also a concern, but can't you get Intel to figure that out with a large enough commitment?
And I think the second point is due to The Market thinking there is no need to spend ever increasing amounts of compute to get to the next level of AI overlordship.
Of course Jevon's paradox is also all in the news these days..
Isn't it because NVIDIA insists on only using the latest nodes from a single company (TSMC) for manufacture?
I don't understand why we can't use older process nodes to boost overall GPU making capacity.
Can't we have tiers of GPU availability some on cutting edge nodes, others built on older Intel and Samsung nodes?
Why is Nvidia not diversifying aggressively to Samsung and Intel no matter the process node.
Can someone explain?
I've heard packaging is also a concern, but can't you get Intel to figure that out with a large enough commitment?
(Also, I know NVIDIA has some capacity on Samsung. But why not go all out, even using Global Foundries?)
Data parallelism, model parallelism, parameter server to workers, MoE itself can be split up, etc.
But even if it wasn’t, you can simply parallelize training runs with slight variations in hyperparameters. That is what the article is describing.
This is the most important point, and why DeepSeek’s cheaper training matters.
And if you check the R1 paper, they have a section for “things that didn’t work”, each of which would normally be a paper of its own but because their training was so cheap and streamlined they could try a bunch of things.
Chiplets have slowed the slowdown in AI, but you can see in the gaming space how much things have slowed to get an idea of what is coming for enterprise.
Before LLMs we had N-Gram language models. Many tasks like speech recognition worked as beach search in the graph defined by the ngram language model. You could easily get huge accuracy gains simply by pruning your beam less.
s1 reminds of this. You can always trade off latency for accuracy. Given these LLMs are much more complex than good old N-Grams, we're just discovering how to do this trade.
Because that's what gets the clicks...
Saying they spent a boatload of money on the initial training + iteration + final fine-tuning isn't as headline grabbing as "$5 million trained AI beats the pants off the 'mericans".
The traces are generated by Gemini Flash Thinking.
8 hours of H100 is probably more like $24 if you want any kind of reliability, rather than $6.
If you were a reporter who didn't necessarily understand how to value a particular algorithm or training operation, but you wanted a simple number to compare the amount of work OpenAI vs. Google vs Facebook are putting into their models, yeah, it makes sense. How many petaflops their datacenters are churning through in aggregate is probably correlated to the thing you're trying to understand. And it's probably easier to look at their financials and correlate how much they've spent on GPUs to how many petaflops of compute they need.
But when your investors are giving you more money based on how well they perceive you're doing, and their perception is not an oracle but is instead directly based on how much money you're spending... the GPUs don't actually need to do anything other than make number go up.
I don’t believe computer science has the algorithms to handle this new paradigm. Everything was about sequential deterministic outputs, and clever ways to do it fast. This stuff is useless at the moment. We need new thinkers on how to not think sequentially or how not to think about the universe in such a small way.
Verifying input/output pairs is the old way. We need to understand differently going forward.
Even at the height of the Cold War there was always a human between <leader presses button> and <nukes go aflyin'>.
--edit--
...which has me wondering if a president even has the constitutional authority to destroy the entire planet and if one could interpret their command as a 'lawful order'. Makes one think.
The models' latent spaces are insanely large. The vast, vast majority pretty much has to be irrelevant and useless, it's just that the training commandeers random fragments of that space to link up the logic they need and it's really hard to know which of the weights are useless, which are useful but interchangeable with other weights, and which are truly load-bearing. You could probably find out easily by testing the model against every possible thing you ever might want it to do, just as soon as someone gets around to enumerating that non-enumerable collection of tasks.
These bogus <wait> tokens kind of demonstrate that the models are sort of desperate to escape the limitations imposed by the limited processing they're allowed to do -- they'll take advantage of thinking time even when it's provided in the silliest manner possible. It's amazing what you can live if it's all you have!
(Apologies for the extended anthropomorphizing.)
> After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that’s needed to achieve o1-preview performance on a 32B model. Adding data didn’t raise performance at all.
> 32B is a small model, I can run that on my laptop. They used 16 NVIDIA H100s for 26 minutes per training run, that equates to around $6.
I know some are really opposed to anthropomorphizing here, but this feels eerily similar to the way humans work, ie. if you just dedicate more time to analyzing and thinking about the task, you are more likely to find a better solution
It also feels analogous to navigating a tree, the more time you have to explore the nodes, the bigger the space you'll have covered, hence higher chance of getting a more optimal solution
At the same time, if you have "better intuition" (better training?), you might be able to find a good solution faster, without needing to think too much about it
Transformers were aimed to solve the "context" problem and authors, being aware that RNNs don't scale at all neither do they solve that particular problem, had to come up with the algorithm that overcomes both of those issues. It turned out that the self-attention compute-scale was the crucial ingredient to solve the problem, something that RNNs were totally incapable of.
They modeled the algorithm to run on the hardware they had at that time available but hardware developed afterwards was a direct consequence, or how I called it a byproduct, of transformers proving themselves to be able to continuously scale. Had that not be true, we wouldn't have all those iterations of NVidia chips.
So, although one could say that the NVidia chip design is what enabled the transformers success, one could also say that we wouldn't have those chips if transformers didn't prove themselves to be so damn efficient. And I'm inclined to think the latter.
However, I think this is coming. DeepSeek mentioned it was hard to learn a value model for MCTS from scratch, but this doesn’t mean we couldn’t seed it with some annotated data.
If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.
Yet this is what happens - the distilled or quantized models often come very close to the original model.
So I think there are still many low-hanging fruits to pick.
But, I'm not going to value an operating construction company based on how many shovels or excavators they own. I'm going to want to see them putting those assets to productive use.
That's why "those chips are very valuable" is not necessarily a good way to value companies - and it isn't if they can extract the value from the chips before they become worthless.
> But they usually get replaced.
They usually produce enough income to cover depreciation so you actually have the cash to replace them.
Those models can be trained in way tailored to have good results on specific benchmarks, making them way less general than it seems. No accusation from me, but I'm skeptical on all the recent so called 'breakthroughs'.
>>> hello!
Hello there! I'm here to help with any questions you might have. How can I assist you today? If you're asking a question, feel free to provide some details about what you need help with. I'll do my best to answer your question or guide you through the process. Just let me know!
<|im_start|>think
The user has asked "hello! " and provided no further information. Answer here is a direct response to the user's request for information, which was essentially just a greeting. The appropriate response is to greet back and offer assistance.
<|im_start|>answer
Answer:
Hello! How can I assist you today? If you have a question or need help with something specific, feel free to ask. I'm here to support you!
<|im_start|>think
<|im_start|>answer
The user's message was "hello!" and there was no other information provided. The appropriate response is a greeting back and an offer of assistance.
<|im_start|>answer
Answer:
Hello! How can I assist you today? If you have a question or need help with something specific, feel free to ask. I'm here to support you!
<|im_start|>think
The user has greeted with "hello!" An appropriate response involves acknowledging the greeting and offering assistance. This could involve asking how we can help them or what they'd like guidance on.
<|im_start|>answer
Answer:
Hello! How can I assist you today? If you have a question or need help with something specific, feel free to ask. I'm here to support you!
I think it is interesting what actions cannot be done by humans.
Consider the text:
I think I'll go swimming today. Wait, ___
what comes next? Well, not something that would usually follow without the word "wait", probably something entirely orthogonal that impacts the earlier sentence in some fundamental way, like:
Wait, I need to help my dad.
The larger the organisation, the less experiments you can afford to do. Employees are mostly incentivised by getting something done quick enough to not to be fired in this job market. They know that the higher-ups would get them off for temporary gains. Rush this deadline, ship that feature, produce something that looks OK enough.
I went from accepting I wouldn't see a true AI in my lifetime, to thinking it is possible before I die, to thinking it is possible in in the next decade, to thinking it is probably in the next 3 years to wondering if we might see it this year.
Just 6 months ago people were wondering if pre-training was stalling out and if we hit a wall. Then deepseek drops with RL'd inference time compute, China jumps from being 2 years behind in the AI race to being neck-and-neck and we're all wondering what will happen when we apply those techniques to the current full-sized behemoth models.
It seems the models that are going to come out around summer time may be jumps in capability beyond our expectations. And the updated costs means that there may be several open source alternatives available. The intelligence that will be available to the average technically literate individual will be frightening.
We do understand how they work, we just have not optimised their usage.
For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.
But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.
We need to cluster the AI's insights on a spatial grid hash, give it a minimap with the ability to zoom in and out, and give it the agency to try and find its way to an answer and build up confidence and tests for that answer.
coarse -> fine, refine, test, loop.
Maybe a parallel model that handles the visualization stuff. I imagine its training would look more like computer vision. Mind palace generation.
If you're stuck or your confidence is low, wander the palace and see what questions bubble up.
Bringing my current context back through the web is how I think deeply about things. The context has the authority to reorder the web if it's "epiphany grade".
I wonder if the final epiphany at the end of what we're creating is closer to "compassion for self and others" or "eat everything."
I reflected on the pop-psychology idea of consciousness and subconsciousness. I thought of each as an independent stream of tokens, like stream of consciousness poetry. But along the stream there were joining points between these two streams, points where the conscious stream was edited by the subconscious stream. You could think of the subconscious stream as performing CRUD like operations on the conscious stream. The conscious stream would act like a buffer of short-term memory while the subconscious stream would act like a buffer of long-term memory. Like, the subconscious has instructions related to long-term goals and the conscious stream has instructions related to short-term goals.
You can imagine perception as input being fed into the conscious stream and then edited by the subconscious stream before execution.
It seems entirely possible to actually implement this idea in this current day and age. I mean, it was a fever dream as a kid, but now it could be an experiment!
My literal first thought about AI was wondering why we couldn't just put it in a loop. Heck, one update per day, or one update per hour would even be a start. You have a running "context", the output is the next context (or a set of transformations on a context that is a bit larger than the output window). Then ramp that up ... one loop per minute, one per second, millisecond, microsecond.
Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.
Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.
The fact that we can compress a GPT-3 sized model into an o1 competitor is only the beginning. Maybe there is even more juice to squeeze there?
But even more, how much performance will we get out of o3 sized models? That is what is exciting since they are already performing near Phd levels on most evals.
In this case, I was also forcing R1 to continue thinking by replacing </think> with “Okay,” after augmenting reasoning with web search results.
Because I see these sorts of gnostic assertion about LLMs all the time about how they "definitely aren't doing <thing we normally apply to meat-brains>" by gesturing at the technical things it's doing, with no attempts to actually justify the negative assertion.
It often comes across as privileged reason trying to justify that of course the machine isn't doing some ineffable thing only meat-brains do.
That's not the scary part. The scary part is the intelligence at scale that could be available to the average employer. Lots of us like to LARP that we're capitalists, but very few of us are. There's zero ideological or cultural framework in place to prioritize the well being of the general population over the profits of some capitalists.
AI, especially accelerating AI, is bad news for anyone who needs to work for a living. It's not going to lead to a Star Trek fantasy. It means an eventual phase change for the economy that consigns us (and most consumer product companies) to wither and fade away.
Of course, keeping that coherent over numerous loops isn't going to be easy. No doubt there is a chance it goes off the rails. So you might have a section of context that is constantly stable, a section of context that updates each loop, etc.
In the other response to my comment someone mentioned eventually updating the weights (e.g. daily) and you would in that case have to have some kind of loss function.
This is the line of thinking I'm referring to.
The "context" problem had already been somewhat solved. The attention mechanism existed prior to Transformers and was specifically used on RNNs. They certainly improved it, but innovation of the architecture was making it computation efficient to train.
I'm not really following your argument. Clearly your acknowledging that it was first the case that with the hardware at the time, researchers demonstrated that simply scaling up training with more data yielded better models. The fact that hardware was then optimized for these for these architectures only reinforces this point.
All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.
> this is not only the result of better hardware
Regarding this in particular. A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.
That’s pottery, not sculpture. Traditionally in sculpture you start from a block of marble or wood, but you can also make sculptures of cast bronze or welded steel (or clay, but you don’t use a spinning plate).
TSMC was way ahead of anyone else introducing 5nm. There's a long lead time porting a chip to a new process from a different manufacturer.
> I don't understand why we can't use older process nodes to boost overall GPU making capacity.
> Can't we have tiers of GPU availability?
NVidia do this. You can get older GPUs, but more performance is better for performance sensitive applications like training or running LLMs.
Higher performance needs better manufacturing processes.
I get a lot of "IA will allow us to create SaaS in a weekend" and "IA will take engineers jobs", which I think they both may be true. But a lot of SaaS surive because engineers pay for them -- if engineer don't exist anymore, a lot of SaaS won't either. If you eat your potential customers, creating quick SaaS doesn't make sense anymore (yeah, there are exceptions, etc., I know).
What I am more surprised about is why models actually seem to have to produce "internal thoughts" instead of random tokens. Maybe during training having completely random tokens in thinking section derailed the model's thought process in a same way background noise can derail ours?
Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."
With that being said, I don’t think the benchmarks we currently have are strong enough and the next frontier models are yet to come. I’m sure at this point U.S LLM research firms now understand their lack of infra/hardware optimizations (they just threw compute at the problem), they will begin paying closer attention. Now their RL-level and parent training will become even greater - whilst the newly freed resources to solve for sub-optimizations that have been traditionally avoided due to computational overhead
It also gave them a few months to recoup costs!
> For Land, capitalism begins in Northern Italy around 1500 with "the emerging world of technologists and accountants", the spiral interexcitation of "oceanic navigation and place-value calculation", and zero-unlocked double-entry book-keeping
Fibonacci, amongst many others, played a critical role that highly accelerative technology.
> Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end
I'm feeling proud of myself that I had the crux of the same idea almost 6 months ago before reasoning models came out (and a bit disappointed that I didn't take this idea further!). Basically during inference time, you have to choose the next token to sample. Usually people just try to sample the distribution using the same sampling rules at each step.... but you don't have to! you can selectively insert words into the the LLM's mouth based on what it said previously or what it wants to say, and decide "nah, say this instead". I wrote a library so that you could sample an LLM using llama.cpp in swift and you could write rules to sample tokens and force tokens into the sequence depending on what was sampled. https://github.com/prashanthsadasivan/LlamaKit/blob/main/Tes...
Here, I wrote a test that asks Phi-3 instruct "how are you" and it if it tried to say "as an AI I don't have feelings" or "I'm doing " I forced it to say "I'm doing poorly" and refuse to help since it was always so dang positive. It sorta worked, though the instruction tuned models REALLY want to help. But at the time I just didn't have a great use case for it - I had thought about a more conditional extension to llama.cpp's grammar sampling (you could imagine changing the grammar based on previously sampled text), or even just making it go down certain paths, but I just lost steam because I couldn't describe a killer use case for it.
This is that killer use case! forcing it to think more is such a great usecase for inserting ideas into the LLM's mouth, and I feel like there must be more to this idea to explore.
It should be considered a distinct field. At some level there is overlap (information theory, Kolmogorov complexity, etc.), but prompt optimization and model distillation is far removed from computability, formal language theory, etc. The analytical methods, the techniques to create new architectures, etc. are very different beasts.
In other words, we fly as close to the sun as possible and get our measurements :)
And it would have to be RL for your idea to work since there is no "thinking" dataset for a novel token space. There isn't even one for existing LLM token space, but they have the base model to work off of. When the thought is expressed in English, the model already knows the relationships between the tokens in the thought, it's merely repurposing it for a "thinking" application.
Look, why have game developers spent so much time lazy loading parts of the game world? Very rarely do they just load the whole world, even in 2025. See, the worlds get bigger, so even as the tech gets better, we will always lazy load worlds in.
It’s a context issue right? Developers have just recently been given this thing called “context”.
But yeah man, why do we think just because we walked from our house to the supermarket that this reality didn’t lazy load things. That’s how programmers have been doing it all along …
Anyways
Reasoning as we know it could just be a mechanism to fill in gaps in obviously sparse data (we absolutely do not have all the data to render reality accurately, you are seeing an illusion). Go reason about it all you want.
The LLM doesn’t know anything. We determine what output is right, even if the LLM swears the output is right. We “reason” about it, I guess? Well in this case the whole “reasoning” process is to simply get an output that looks right, so what is reasoning in our case?
Let me just go one ridiculous level lower. If I measure every frame the Hubble telescope takes, and I measure with a simple ruler the distances between things, frame by frame, I can “reason” out some rules of the universe (planetary orbits). In this “reasoning” process, the very basic question of “well why, and who made this” immediately arises, so reasoning always leads to the fundamental question of God.
So, yeah. We reason to see God, because that’s all we’re seeing, everything else is an illusion. Reasoning is inextricably linked to God, so we have to be very open minded when we ask what is this machine doing.
(ptheven)
Wait What? That is an odd way of defining it. That's like saying turing machines are inefficient way to solve TSP. You would , at the least, want to define this in terms of complexity or put this into context of domains and observability.
RL's by definition is a field that is about finding efficient problems in the domain of choice[1]. There are likely regimes in LLM/LRM learning where RL can be quite efficient, polynomial time even in the state space, we just need to explore and find them. For example you can use Dynamic Programming as a "more" efficient way to solve MDPs[1] because it is polynomial in the state space X Action space.
[1]https://web.stanford.edu/class/psych209/Readings/SuttonBarto...
So, if we can never fit large concepts like Jupiter into your viewport (your context window), does it follow that you will never understand Jupiter? Or is there no way to take a picture of Jupiter and make it smaller so it fits into your viewport?
See, coherence is just resizing or reimagining things so it fits into context. Context can never hold it all because we have the capacity to always imagine infinitely large things.
So I don’t fully know if it’s old context out, new context in, but could be just the right context, just enough of it, and just the right looking context so we can assemble the right worldview. This process would have to go on infinitely, and that is the world as you see and understand it.
To enforce a minimum, we suppress the generation of the end-of-thinking token delimiter and optionally append the string “Wait” to the model’s current reasoning trace to encourage the model to reflect on its current generation.
Does this mean that the end-of-thinking delimiter is a single token? Presumably </think> or similar wasn't a single token for the base model. Did they just pick a pair of uncommon single-token symbols to use as delimiters?EDIT: Never mind, end of thinking is represented with <|im_start|> followed by the word 'answer', so the code dynamically adds/removes <|im_start|> from the list of stop tokens.
What the OP suggested is similar to training a transformer from scratch using RL (ie. no training tokens) towards an objective of steering a pretrained LLM to produce human readable output. It will probably not even converge, and if it does it would take immense compute.
If I don’t need a huge model and GPU, then AI is little more than an open source program running on an idle PC.
I feel like AI was NVIDIA’s lifeboat as GPU mining waned. Don’t see anything after that in the near future.
I suspect CS departments don’t want to concede because they are now in the limelight…
We're getting to the run-your-capable-LLM on-prem or at-home territory.
Without DeepSeek (and hopefully its successors) I wouldn't really have a usecase for something like NVIDIAs Project Digits.
That sounds very useful, albeit a bit different than how current "chat" implementations would work, as in you could control both ways of the conversation.
The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.
A lot of those will probably go under, too. I think a lot of people are in for a rude awakening.
The only people our society and economy really values are the elite with ownership and control, and the people who get to eat and have comfort are those who provide things that are directly or indirectly valuable to that elite. AI will enable a game of musical chairs, with economic participants iteratively eliminated as the technology advances, until there are only a few left controlling vast resources and capabilities, to be harnessed for personal whims. The rest of us will be like rats in a city, scraping by on the margins, unwanted, out of sight, subsisting on scraps, perhaps subject to "pest control" regimes.
LLMs are more analogous to economics, psychology, politics -- it is possible there's a core science with explicability, but the systems are so complex that even defining the question is hard.
As the models have gotten more efficient and distillation better the minimum viable hardware for really cooking with LLMs has gone from a 4090 to suddenly something a lot of people already probably own.
I definitely think a Digits box would be nice, but honestly I’m not sure I’ll need one.
That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.
Why would you control the inference at the token level? Wouldn’t the more obvious (and technically superior) place to control repeat analysis of the optimal path through the search space be in the inference engine itself?
Doing it by saying “Wait” feels like fixing dad’s laptop over a phone call. You’ll get there, but driving over and getting hands on is a more effective solution. Realistically, I know that getting “hands on” with the underlying inference architecture is way beyond my own technical ability. Maybe it’s not even feasible, like trying to fix a cold with brain surgery?
If AI lives up to the hype, that will become possible.
> If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?
I don't think it would be that. There'd also be power generation, manufacturing, mining, and construction, etc.; but all extremely automated. If you get to truly extreme levels of wealth concentration, things would shift out of our capitalist market system model, and concepts like "profit" would become anachronisms.
It actually might kinda look like the "economy" of Starcraft: you gather resources, decide what to build with them, and order it all around according to your whim. There will be a handful of guys playing, and everyone else will be a NPC.
The downside is that you are limiting the model to think in the same language it outputs. An argument could be made that this is not how all humans think. I know that I rarely think in language or even images, just concepts (probably isn't even the right word) mix and transform and often I don't even bother to make the transformation to language at the end, just action.
and
> researchers demonstrated that simply scaling up training with more data yielded better models
and
> The fact that hardware was then optimized for these for these architectures only reinforces this point.
and
> All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.
is what I am saying as well. I read the majority of those papers so this is all very known to me but I am perhaps writing it down in a more condensed format so that other readers that are light on the topic can pick the idea easier.
> A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.
Ok, I see your point and the conclusion here is what we disagree with. You say that the innovation was simply enabled by the better hardware whereas I say that that better hardware wouldn't have its place if there hadn't been a great innovation in the algorithm itself. I don't think it's fair to say that the innovation is driven by the NVidia chips.
I guess my point, simplistically saying, is if we had a lousy algorithm, new hardware wouldn't mean anything without rethinking or rewriting the algorithm. And with the transformers, this definitely hadn't been the case. There had been plenty of optimizations throughout the years in order to better utilize the HW (e.g. flash-attention) but the architecture of transformers remained more or less the same.
So how to externally prod it to think more when it does terminate? Replacing thought termination with a linguistic signifier of continued reasoning plus novel realization seems like a charmingly simple, principled, and general approach to continue to traverse concept space.
How would that work? If there are no consumers then why even bother producing? If the cost of labor and capital trends towards zero then the natural consequence is incredible deflation. If the producers refuse to lower their prices then they either don’t participate in the market (which also means their production is pointless) or ensure some other way that the consumers can buy their products.
Our society isn’t really geared for handling double digit deflation so something does need to change if we really are accelerating exponentially.
Kidding, but not really. It's fascinating how we seem to be seeing a gradual convergence of machine learning and psychology.
It's like saying we don't understand how quantum chromodynamics works. Very few people do, and it's the kind of knowledge not easily distilled for the masses in an easily digestible in a popsci way.
Look into how older CNNs work -- we have very good visual/accesible/popsci materials on how they work.
I'm sure we'll have that for LLM but it's not worth it to the people who can produce that kind of material to produce it now when the field is moving so rapidly, those people's time is much better used in improving the LLMs.
The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks.
I guess if the “players” are sociopathic enough they might decide to just wipe out the NPCs. The possibility of someone like Putin or Musk becoming the sole member of the post-singularity humanity does make me pause.
The above 'hack' is one of many realizations of the above differences.
As reasoning improves the models could start with a basic set of principles and build from there. Of course for facts grounded in reality RAG would still likely be the best, but maybe with enough “reasoning” a model could simulate an approximation of the universe well enough to get to an answer.
(discussed here: https://news.ycombinator.com/item?id=34724477 )
(1) The abstract changed how I thought about this domain (advanced reasoning models). The only other paper that did that for me was the "Memory Resource Management in VMware ESX Server". And that paper got published 23 years ago.
(2) The model, data, and code are open source at https://github.com/simplescaling/s1. With this, you can start training your own advanced reasoning models. All you need is a thousand well-curated questions with reasoning steps.
(3) More than half the references in the paper are from 2024 and Jan 2025. Just look at the paper's first page. https://arxiv.org/pdf/2501.19393 In which other field do you see this?
Don’t be, that’s pretty common.
If you want a true comparison, start asking that question every time and then compare. My hypothesis is students would start ignoring the prompt and answering “yes” every time to get on with it.
> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks
I say this less as an authoritative voice but more as an amused insider: Spend a week with some ML grad students and you will get a chuckle whenever somebody says we’re not some monkeys throwing things at GPUs.
With neural networks big or small, we got no clue what’s going on. You can observe the whole system, from the weights and biases, to the activations, gradients, etc and still get nothing.
On the other hand, one of the reasons why economics, psychology and politics are hard is because we can’t open up people’s heads and define and measure what they’re thinking.
Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).
At any rate, I don't think distillation involves 'slurping out' the whole model, as I understand it, it means providing the other model's output as training data input to create your new model. Maybe analogous to an expert teaching a novice how to do something by providing carefully selected examples, without having to expose the novice to all the blind alleys the expert went down to achieve mastery.
https://youtube.com/shorts/7GrecDNcfMc
Many many layers of that. It’s not a profound mechanism. We can understand how that works, but we’re dumbfounded how such a small mechanism is responsible for all this stuff going on inside a brain.
I don’t think we don’t understand, it’s a level beyond that. We can’t fathom the implications, that it could be that simple, just scaled up.
This isn’t true. The biggest companies are all rich because they cater to the massive US middle class. That’s where the big money is at.
Why? In the spiritual realm, many postulated that even the Elephant you never met is part of your life.
None of this is a coincidence.
Has anyone run it on a laptop (unquantized)? Disk size of the 32B model appears to be 80GB. Update: I'm using a 40GB A100 GPU. Loading the model took 30GB vRAM. I asked a simple question "How many r in raspberry". After 5 minutes nothing got generated beyond the prompt. I'm not sure how the author ran this on a laptop.
We're racing up a hill at an ever-increasing speed, and we don't know what's on the other side. Maybe 80% chance that it's either nothing or "simply" a technological revolution.
I like this version for at least two reasons:
1. It is 100% compliant with large quantities of scientific findings (psychology and neuroscience), whreas I believe yours has a conservation of mass problem at least
2. Everyone dislikes it at least in certain scenarios (say, when reference is made to it during an object level disagreement)
Not to mention false GPS etc...
It is true, but I can see why you'd be confused. Let me ask you this: if members of the "the massive US middle class" can be replaced with automation, are those companies going 1) to keep paying those workers to support the middle-class demand which made them rich, or are they going to 2) fire them so more money can be shoveled up to the shareholders?
The answer is obviously #2, which has been proven time and again (e.g. how we came to have "the Rust Belt").
> That’s where the big money is at
Now, but not necessarily in the future. I think AI (if it doesn't hit a wall) will change that, maybe not instantaneously, but over time.
Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them. And the problem of "human alignment" is one unsolved since Cain and Abel.
AI alone is words on a screen.
The sibling thread details the "mass unemployment" scenario, which would be destabilizing, but understates how much of the current world of work is still physical. It's a threat to pure desk workers, but we're not the majority of the economy.
Perhaps there will be political instability, but .. we're already there from good old humans.
Whim and ego. I think the advanced economy will shift to supporting trillionaires doing things like "DIY home improvement" for themselves. They'll own a bunch of automated resources (power generation, mining, manufacturing, AI engineers), and use it to do whatever they want. Build pyramids on the moon, while the now economically-useless former middle-class laborers shiver in the cold? Sure, why not?
> I've yet to see really good articulations of what, precisely we should be scared of. Bedroom superweapons?
Loss of paid employment opportunities and increasing inequality are real world concerns.
UBI isn't coming by itself.
Plenty of profit was made off feudalism, and technofeudalism has all the tools of modern technology at its disposal. If things go in that direction, they will have an unlimited supply of serfs desperate for whatever human work/livelihood is left.
It has certainly had this impact on my identity; I am unclear how well-grounded I really am*.
> I've yet to see really good articulations of what, precisely we should be scared of.
What would such an articulation look like, given you've not seen it?
> Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them.
Even with current limited systems — which are not purely desk workers, they're already being connected to and controlling robots, even by amateurs — AI lowers the minimum human skill level needed to do those things.
The fear is: how far are we from an AI that doesn't need a human in the loop? Because ChatGPT was almost immediately followed by ChaosGPT, and I have every reason to expect people to continue to make clones of ChaosGPT continuously until one is capable of actually causing harm. (As with 3d-printed guns, high chance the first ones will explode in the face of the user rather than the target).
I hope we're years away, just as self driving cars turned out to be over-promised and under-delivered for the last decade — even without a question of "safety", it's going to be hard to transition the world economy to one where humans need not apply.
> And the problem of "human alignment" is one unsolved since Cain and Abel.
Yes, it is unsolved since time immemorial.
This has required us to not only write laws, but also design our societies and institutions such that humans breaking laws doesn't make everything collapse.
While I dislike the meme "AI == crypto", one overlap is that both have nerds speed-running discovering how legislation works any why it's needed — for crypto, specifically financial legislation after it explodes in their face; for AI, to imbue the machine with a reason to approximate society's moral code, because they see the problem coming.
--
* Dunning Kruger applies; and now I have first-hand experience of what this feels like from the inside, as my self-perception of how competent I am at German has remained constant over 7 years of living in Germany and improving my grasp of the language the entire time.
(Also, if I might give a recommendation, you might be the type of person to enjoy Unsong by Scott Alexander https://unsongbook.com/)
> If the producers refuse to lower their prices then they either don’t participate in the market (which also means their production is pointless) or ensure some other way that the consumers can buy their products.
Imagine you're a billionaire with a data centre and golden horde of androids.
You're the consumer, the robots make stuff for you; they don't make stuff for anyone else, just you, in the same way and for the same reason that your power tools and kitchen appliances don't commute to work — you could, if you wanted, lend them to people, just like those other appliances, but you'd have to actually choose to, it wouldn't be a natural consequence of the free market.
Their production is, indeed, pointless. This doesn't help anyone else eat. The moment anyone can afford to move from "have not" to "have", they drop out of the demand market for everyone else's economic output.
I don't know how big the impact of dropping out would be: the right says "trickle down economics" is good and this would be the exact opposite of that; while the left criticism's of trickle-down economics is that in practice the super-rich already have so much stuff that making them richer doesn't enrich anyone else who might service them, so if the right is correct then this is bad but if the left is correct then this makes very little difference.
Unfortunately, "nobody knows" is a great way to get a market panic all by itself.
The Newtonian model makes provably less accurate predictions than Einsteinian (yes, I'm using a different example), so while still useful in many contexts where accuracy is less important, the number of parameters it requires doesn't much matter when looking for the one true GUT.
My understanding, again as a filthy computationalist, is that an accurate model of the real bonafide underlying architecture of the universe will be the simplest possible way to accurately predict anything. With the word "accurately" doing all the lifting.
As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/Nature%20of%20...
I'm sure there are decreasingly accurate, but still useful, models all the way up the computational complexity hierarchy. Lossy compression is, precisely, using one of them.
Just like alchemists made enormous strides in chemistry, but their goal was to turn piss into gold.
It sounds like you need to play with some of the existing 32B models with better documentation on how to run them if you're having trouble, but it is entirely plausible to run this on a laptop.
I can run Qwen2.5-Instruct-32B-q4_K_M at 22 tokens per second on just an RTX 3090.
Bad argument. Cavemen understood stone, but they could not build the aqueducts. Medieval people understood iron, water and fire but they could not make a steam engine
Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make
At the time I had this idea I did not know of either of these. I think I was drawing explicitly on the conscious / subconscious vocabulary.
In a general sense, uncertainty causes anxiety. Once you know the properties of the monster you are dealing with you can start planning on how to address it.
Some people have blind and ignorant confidence. A feeling they can take on literally anything, no matter how powerful. Sometimes they are right, sometimes they are wrong.
I'm reminded by the scene in No Country For Old Men where the good guy bad-ass meets the antagonist and immediately dies. I have little faith in blind confidence.
edit: I'll also add that human adaptability (which is probably the trait most confidence in humans would rest) has shown itself capable of saving us from many previous civilization changing events. However, this change with AI is happening much, much faster than any before it. So part of the anxiety is whether or not our species reaction time is enough to avoid the cliff we are accelerating towards.
How far removed are you from a caveman is the better question. There would be quite some arrogance coming out of you to suggest the several million years gap is anything but an instant in the grand timeline. As in, you understood stone just yesterday ...
The monkey that found the stone is the monkey that built the cathedral. It's only a delusion the second monkey creates to separate it from the first monkey (a feeling of superiority, with the only tangible asset being "a certain amount of notable time passed since point A and point B").
"Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make"
You and I agree. That those simple things can truly create infinite possibilities. That's all I was saying, we cannot fathom it (either because infinity is hard to fathom, or that it's origins are humble - just a few core elements, or both, or something else).
Anyway, this can discussion can head into any direction.
Running where? H100s are usually over $2/hr, thats closer to $25
I think it's clear that innovative breakthroughs in bleeding-edge research are not just a matter of blindly hurling more money at a company to build unprecedentedly expensive datacenters
But also, even if that was a way to do it, I don't think we should be wielding the law to enable privately-held companies to be at the forefront of research, especially in such a grossly inconsistent manner
"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.
Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.
(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")
So it's lossy all the way down with LLMs, too.
Reality > Data created by a human > LLM > Distilled LLM
Even if you give GenAI unlimited time, it will not develop its own writing/drawing/painting style or come up with a novel idea, because strictly by how it works it can only create „new” work by interpolating its dataset
Not my question to answer, I think that lies in philosophical questions about what is a "law".
I see useful abstractions all the way down. The linked Asimov essay covers this nicely.
The compatibility of determinism and freedom of will is still controversially debated. There is a good chance that Humans don’t „create“.
> There’s no deliberation behind AI, just statistical probabilities. There’s no new or spontaneous thoughts, at most pseudorandomness introduced by the author of the model interface.
You can say exactly the same about deterministic humans since it is often argued that the randomness of thermodynamic or quantum mechanical processes is irrelevant to the question of whether free will is possible. This is justified by the fact that our concept of freedom means a decision that is self-determined by reasons and not a sequence of events determined by chance.
The kind of Occam’s Razor-ish rule you seem to be trying to query about is basically a rule of thumb for selecting among formulations of equal observed predictive power that are not strictly equivalent (that is, if they predict exactly the same actually observed phenomenon instead of different subsets of subjectively equal importance, they still differ in predictions which have not been testable), whereas Newtonian and Lagrangian mechanics are different formulations that are strictly equivalent, which means you may choose between them for pedagogy or practical computation, but you can't choose between them for truth because the truth of one implies the truth of the other, in either direction; they are the exactly the same in sibstance, differing only in presentation.
(And even where it applies, its just a rule of thumb to reject complications until they are observed to be necessary.)
There is no evidence whatsoever to support that humans create "new, spontaneous thoughts" in any materially, qualitatively different way than an AI. In other words: As a Turing-computable function over the current state. It may be that current AI's can't, but the notion that there is some fundamental barrier is a hypothesis with no evidence to support it.
> Even if you give GenAI unlimited time, it will not develop its own writing/drawing/painting style or come up with a novel idea, because strictly by how it works it can only create „new” work by interpolating its dataset
If you know of any mechanism whereby humans can do anything qualitatively different, then you'd have the basis for a Nobel Prize-winning discovery. We know of no mechanism that could allow humans to exceed the Turing computability that AI models are limited to.
We don't even know how to formalize what it would mean to "come up with a novel idea" in the sense you appear to mean, as presumably, something purely random would not satisfy you, yet something purely Turing computable would also not do, but we don't know of any computable functions that are not Turing computable.
You need to be a bit more expansive. Turing-computable functions need to halt and return eventually. (And they need to be proven to halt.)
> We know of no mechanism that could allow humans to exceed the Turing computability that AI models are limited to.
Depends on which AI models you are talking about? When generating content, humans have access to vastly more computational resources than current AI models. To give a really silly example: as a human I can swirl some water around in a bucket and be inspired by the sight. A current AI model does not have the computational resources to simulate the bucket of water (nor does it have a robotic arm and a camera to interact with the real thing instead.)
Determinism and free will are pretty irrelevant here.
Unless P=NP, there's no way for us to distinguish in general between eg pseudo random systems and truly random systems from the outside.
Btw, I don't think determinism in humans/AI has anything to do with deliberation.
The newest AI models are allowed to deliberate. At least by some meanings of the word.
> This is justified by the fact that our concept of freedom means a decision that is self-determined by reasons and not a sequence of events determined by chance.
Well, different people have different definitions here. None of them very satisfying.
Personally, I think UBI is a ploy to keep the "huge starved mob[s]" pacified during the transition, when they still have enough power to act, before the tech oligarchs fully cement their control.
Once the common people are powerless to protect themselves and their interests, then they'll be left to die out.
The question of whether the mechanism of learning in a human brain and that in an artificial neural network is similar is a philosophical and perhaps technical one that is interesting, but not relevant to why intellectual property law was conceived: To economically incentivize human citizens to spend their time producing creative works. I don't actually think property law is a good way to do this. Nonetheless the question when massive capital investments are used to scrape artists' work in order to undercut their ability to make a living from that work for the benefit of private corporations that do not have their consent to do this is whether this should violate this artificial notion of intellectual property that we have constructed for this purpose, and in that sense, it's fairly obvious that the answer is yes
No. It’s the other way around. Free will is the basic for „creating something new“.
> Btw, I don't think determinism in humans/AI has anything to do with deliberation.
With determinism there is no deliberation.
As far as we can tell, all the laws of the universe are completely deterministic. (And that includes quantum mechanics.) As far as we can tell, human beings obey the laws of physics.
(To explain: quantum mechanics as a theory is completely deterministic and even linear. Some outdated interpretations of quantum mechanics, like Copenhagen, use randomisation. But interpretations don't make a difference to what the underlying theory actually is. And more widely accepted interpretations like 'Many Worlds' preserve the determinism of the underlying theory.)
Btw, neural nets are typically sampled from, and you can use as good a random number generator (even a physical random number generator) as there is, if you want to. I don't think it'll change what we think neural nets are capable of.
If you want to argue it's a distraction, argue that with the person I replied to, who was the person who changed the focus.
This is pedantry. Any non-halting function can be decomposed into a step function and a loop. What matters is that step function. But ignoring that, human existence halts, and so human thought processes can be treated as a singular function that halts.
> Depends on which AI models you are talking about? When generating content, humans have access to vastly more computational resources than current AI models. To give a really silly example: as a human I can swirl some water around in a bucket and be inspired by the sight. A current AI model does not have the computational resources to simulate the bucket of water (nor does it have a robotic arm and a camera to interact with the real thing instead.)
An AI model does not have computational resources. It's a bunch of numbers. The point is not the actual execution but theoretical computational power if unconstrained by execution environment.
The Church-Turing thesis also presupposes an unlimited amount of time and storage.
Here is something that Newtonian mechanics and Lagrangian mechanics have in common: it is necessary to specify whether the context is Minkowski spacetime, or Galilean spacetime.
Before the introduction of relativistic physics the assumption that space is euclidean was granted by everybody. The transition from Newtonian mechanics to relativistic mechanics was a shift from one metric of spacetime to another.
In retrospect we can recognize Newton's first law as asserting a metric: an object in inertial motion will in equal intervals of time traverse equal distances of space.
We can choose to make the assertion of a metric of spacetime a very wide assertion: such as: position vectors, velocity vectors and acceleration vectors add according to the metric of the spacetime.
Then to formulate Newtonian mechanics these two principles are sufficient: The metric of the spacetime, and Newton's second law.
Hamilton's stationary action is the counterpart of Newton's second law. Just as in the case of Newtonian mechanics: in order to express a theory of motion you have to specify a metric; Galilean metric or Minkowski metric.
To formulate Lagrangian mechanics: choosing stationary action as foundation is in itself not sufficent; you have to specify a metric.
So: Lagrangian mechanics is not sparser; it is on par with Newtonian mechanics.
More generally: transformation between Newtonian mechanics and Lagrangian mechanics is bi-directional.
Shifting between Newtonian formulation and Lagrangian formulation is similar to shifting from cartesian coordinates to polar coordinates. Depending on the nature of the problem one formulation or the other may be more efficient, but it's the same physics.
There's also more than one way to think about complexity. Newtownian mechanics in practice requires introducing forces everywhere, especially for more complex systems, to the point that it can feel a bit ad hoc. Lagrangian mechanics very often requires fewer such introductions and often results in descriptions with fewer equations and fewer terms. If you can explain the same phenomenon with fewer 'entities', then it feels very much like Occam's razor would favor that explanation to me.
See https://scottaaronson.blog/?p=735 'Why Philosophers should care about Computational Complexity'
Basically, what the brain can do in reasonable amounts of time (eg polynomial time), computers can also do in polynomial time. To make it a thesis something like this might work: "no physically realisable computing machine (including the brain) can do more in polynomial time than BQP already allows" https://en.wikipedia.org/wiki/BQP
We’ve been working on a project together, and every morning for the past two months, she’s sent me clean, perfectly organized FED data. I assumed she was just working late to get ahead. Turns out, she automated the whole thing. She even scheduled it to send automatically. Tasks that used to take hours. Gathering 1000s of rows of data, cleaning it, running a regression analysis, time series, hypothesis testing etc… she now completes almost instantly. Everything. Even random things like finding discounts for her Pilates class. She just needs to check and make sure everything is good. She’s not super technical so I was surprised she could do these complicated workflows but the craziest part is that she just prompted the whole thing. She just types something like “compile a list of X, format it into a CSV, and run X analysis” or “go to Y, see what people are saying, give me background of the people saying Z” And it just works. She’s even joking about connecting it to the office printer. I’m genuinely baffled. The barrier to effort is gone.
Now we’ve got a big market report due next week, and she told me she’s planning to use DeepResearch to handle it while she takes the week off. It’s honestly wild. I don’t think most people realize how doomed knowledge work is.
In terms of Newtonian mechanics the members of the equivalence class of inertial coordinate systems are related by Galilean transformation.
In terms of relativistic mechanics the members of the equivalence class of inertial coordinate systems are related by Lorentz transformation.
Newton's first law and Newton's third law can be grouped together in a single principle: the Principle of uniformity of Inertia. Inertia is uniform everywhere, in every direction.
That is why I argue that for Newtonian mechanics two principles are sufficient.
The Newtonian formulation is in terms of F=ma, the Lagrangian formulation is in terms of interconversion between potential energy and kinetic energy
The work-energy theorem expresses the transformation between F=ma and potential/kinetic energy The work-energy theorem: I give a link to an answer by me on physics.stackexchange where I derive the work-energy theorem https://physics.stackexchange.com/a/788108/17198
The work-energy theorem is the most important theorem of classical mechanics.
About the type of situation where the Energy formulation of mechanics is more suitable: When there are multiple degrees of freedom then the force and the acceleration of F=ma are vectorial. So F=ma has the property that the there are vector quantities on both sides of the equation.
When expressing in terms of energy: As we know: the value of kinetic energy is a single value; there is no directional information. In the process of squaring the velocity vector directional information is discarded, it is lost.
The reason we can afford to lose the directional information of the velocity vector: the description of the potential energy still carries the necessary directional information.
When there are, say, two degrees of freedom the function that describes the potential must be given as a function of two (generalized) coordinates.
This comprehensive function for the potential energy allows us to recover the force vector. To recover the force vector we evaluate the gradient of the potential energy function.
The function that describes the potential is not itself a vector quantity, but it does carry all of the directional information that allows us to recover the force vector.
I will argue the power of the Lagrangian formulation of mechanics is as follows: when the motion is expressed in terms of interconversion of potential energy and kinetic energy there is directional information only on one side of the equation; the side with the potential energy function.
When using F=ma with multiple degrees of freedom there is a redundancy: directional information is expressed on both sides of the equation.
Anyway, expressing mechanics taking place in terms of force/acceleration or in terms of potential/kinetic energy is closely related. The work-energy theorem expresses the transformation between the two. While the mathematical form is different the physics content is the same.
Now the other poster has argued that science consists of finding minumum complexity explanations of natural phenomena, and I just argued that the 'minimal complexity' part should be left out. Science is all about making good predictions (and explanations), Occam's razor is more like a guiding principle to help find them (a bit akin to shrinkage in ML) rather than a strict criterion that should be part of the definition. And my example to illustrate this was Newtonian mechanics, which in a complexity/Occam's sense should be superseded by Lagrangian, yet that's not how anyone views this in practice. People view Lagrangian mechanics as a useful calculation tool to make equivalent predictions, but nobody thinks of it as nullifying Newtownian mechanics, even though it should be preferred from Occam's perspective. Or, as you said, the physics content is the same, but the complexity of the description is not, so complexity does not factor into whether it's physics.
The argument I made in no way rests on a "complete picture of human learning". The only thing they rest on is lack of evidence of computation exceeding the Turing computable set. Finding evidence of such computation would upend physics, symbolic logic, maths. It'd be a finding that'd guarantee a Nobel Prize.
I gave the justification. It's a simple one, and it stands on its own. There is no known computable function that exceeds the Turing computable, and all Turing computable functions can be computed on any Turing complete system. Per the extended Church Turing thesis this includes any natural system given the limitations of known physics. In other words: Unless you can show knew, unknown physics, human brains are computers with the same limitations as any electronic computer, and the notion of "something new" arising from humans, other than as a computation over pre-existing state, in a way an electronic computer can't also do, is an entirely unsupportable hypothesis.
> and it needs to be addressed wherever possible that the ontological question is not what matters here
It may not be what matters to you, but to me the question you clearly would prefer to discuss is largely uninteresting.
Not sure what caused the above but In my opinion not only is the training affected by the date of training data (ie it refuses to answer properly because every year of the training data there was fewer or lower quality examples at the end of the year), or whether it's a cultural impression of humans talking about going on holiday/having a break etc in the training data at certain times and the model associating this with the meaning of "having a break".
I still wonder if we're building models wrong by training them on a huge amount of data from the Internet, then fine tuning for instruct where the model learns to make certain logical associations inherent or similar to the training data (which seems to introduce a myriad of issues like the strawberry problem or is x less than y being incorrect).
I feel like these models would have a lot more success if we trained a model to learn logic/problem solving separately without the core data set or to restrict the instruct fine tuning in some way so that we reduce the amount of "culture" it gleans from the data.
There's so much that we don't know about this stuff yet and it's so interesting to see something new in this field every day. All because of a wee paper on attention.
For example if I ask "If I have two foxes and I take away one, how many foxes do I have?" I reckon attention has been hijacked to essentially highlight the "if I have x and take away y then z" portion of the query to connect to a learned sequence from readily available training data (apparently the whole damn Internet) where there are plenty of examples of said math question trope, just using some other object type than foxes.
I think we could probably prove it by tracing the hyperdimensional space the model exists in and ask it variants of the same question/find hotspots in that space that would indicate it's using those same sequences (with attention branching off to ensure it replies with the correct object type that was referenced).
But at some point (still quite far away) I'm sure we'll start training a more general purpose model, or an LLM self-training will break outside of the "you're a language model" bounds and we'll end up with exactly that;
An LLM model in a self-training loop that breaks outside of what we've told it to be (a Language model), becomes a general purpose model and then becomes intelligent enough to do something like put itself out onto the Internet. Obviously we'd catch the feelers that it puts out and realise that this sort of behaviour is starting to happen, but imagine if we didn't? A model that trained itself to be general purpose but act like a constantly executing LLM, uploads itself to Hugging Face, gets run on thousands of clusters by people, because it's "best in class" and yes it's sitting there answering LLM type queries but also in the background is sending out beacons & communicating with itself between those clusters to...idk do something nefarious.
It's weird because I feel like we should've known that from work in general logic/problem solving studies, surely?
Besides the fact that consumers will still always want GPUs for gaming, rendering, science compute etc.
No, I don't have any Nvidia stocks.
I think if OpenAI (or any other company) are paid for their compute time/access as anybody would, then using content generated by other models is fair game. Because it's an active/ongoing cost and not a passive one.
Whereas if someone trained on my dumb Tweets or HN posts then so be it; it's a passive cost for me - I paid my time to say x thing for my own benefits (tribal monk-e social interaction) therefore I have already gotten the value out of it.
And if you say that a model could not create anything new because of it's static data set but humans could...I disagree with that because us humans are working with a data set that we add to some days, but if we use the example of writing a TV script, the writer draw from their knowledge (gained thru life experience) that is as finite as a model's training set is.
I've made this sort of comment before. Even look to high fantasy; what are elves but humans with different ears? Goblins are just little humans with green skin. Dragons are just big lizards. Minotaurs are just humans but mixed with a bull. We basically create no new ideas - 99% of human "creativity" is just us riffing on things we know of that already exist.
I'd say the incidences of humans having a brand new thought or experience not rooted in something that already exists is very, very low.
Even just asking free chat gpt to make me a fantasy species with some culture and some images of the various things it described does pretty well; https://imgchest.com/p/lqyeapqkk7d. But it's all rooted in existing concepts, same as anything most humans would produce.
This does not directly prove the theory your parent comment posits, being that better circumstances during a child's development improve the development of that child's brain. That would require success being a good predictor of brain development, which I'm somewhat uncertain about.