Most active commenters

palmotea(6)
fennecfoxy(6)
t_mann(5)
bloomingkales(4)
danans(4)
ascorbic(3)
gessha(3)
BobbyTables2(3)
raducu(3)
andreasmetsala(3)

Popular/hot comments

>>42953577 #
>>42959159 #
>>42963164 #
>>42955228 #
>>42956212 #
>>42959347 #
>>42959520 #
>>42959654 #
>>42960464 #
>>42966955 #

←back to thread

S1: A $6 R1 competitor?

(timkellogg.me)

1. mtrovo ◴[05 Feb 25 16:48 UTC] No.42951263[source]▶

>>42946854 (OP) #

I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

replies(16): >>42951704 #>>42951764 #>>42951829 #>>42953577 #>>42954518 #>>42956436 #>>42956535 #>>42956674 #>>42957820 #>>42957909 #>>42958693 #>>42960400 #>>42960464 #>>42961717 #>>42964057 #>>43000399 #

2. nyoomboom ◴[05 Feb 25 17:10 UTC] No.42951704[source]▶

>>42951263 (TP) #

I think a skill here is learning a bias for experimentation and accepting the results one finds. Also the book "Why Greatness Cannot Be Planned" showcases the kind of open ended play that results in people discovering stuff like this.

3. cubefox ◴[05 Feb 25 17:13 UTC] No.42951764[source]▶

>>42951263 (TP) #

Now imagine where we are in 12 months from now. This article from February 5 2025 will feel quaint by then. The acceleration keeps increasing. It seems likely we will soon have recursive self-improving AI -- reasoning models which do AI research. This will accelerate the rate of acceleration itself. It sounds stupid to say it, but yes, the singularity is near. Vastly superhuman AI now seems to arrive within the next few years. Terrifying.

replies(2): >>42952687 #>>42955196 #

4. koala_man ◴[05 Feb 25 17:18 UTC] No.42951829[source]▶

>>42951263 (TP) #

It feels like we're back in 1900 when anyone's clever idea (and implementation) can give huge performance improvements, such as Ford's assembly line and Taylor's scientific management of optimizing shovel sizes for coal.

replies(1): >>42955744 #

5. gom_jabbar ◴[05 Feb 25 18:10 UTC] No.42952687[source]▶

>>42951764 #

Yes, and Accelerationism predicted this development back in the 1990s, perhaps most prominently in the opening lines of Nick Land's Meltdown (1994) text:

  [[ ]] The story goes like this: Earth is captured by a technocapital singularity as renaissance rationalization and oceanic navigation lock into commoditization take-off. Logistically accelerating techno-economic interactivity crumbles social order in auto-sophisticating machine runaway. As markets learn to manufacture intelligence, politics modernizes, upgrades paranoia, and tries to get a grip.

> reasoning models which do AI research

In the introduction to my research project on Accelerationism [0], I write:

  Faced with the acceleration of progress in Artificial Intelligence (AI) — with AI agents now automating AI research and development —, Accelerationism no longer seems like an abstract philosophy producing empty hyperstitional hype, but like a sober description of reality. The failed 2023 memorandum to stop AI development on systems more powerful than OpenAI's ChatGPT-4 perfectly illustrates the phenomenological aspects of Accelerationism: "To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon." [1]

At the current rate of acceleration, if you don't write hyperstitionally, your texts are dead on arrival.

[0] https://retrochronic.com/

[1] Nick Land (2017). A Quick-and-Dirty Introduction to Accelerationism in Jacobite Magazine.

replies(2): >>42957256 #>>42959107 #

6. xg15 ◴[05 Feb 25 19:13 UTC] No.42953577[source]▶

>>42951263 (TP) #

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

replies(5): >>42955228 #>>42956999 #>>42957002 #>>42959159 #>>42966394 #

7. ascorbic ◴[05 Feb 25 20:18 UTC] No.42954518[source]▶

>>42951263 (TP) #

I've noticed that R1 says "Wait," a lot in its reasoning. I wonder if there's something inherently special in that token.

replies(2): >>42954757 #>>42959520 #

8. lionkor ◴[05 Feb 25 20:38 UTC] No.42954757[source]▶

>>42954518 #

Semantically, wait is a bit of a stop-and-breathe point.

Consider the text:

I think I'll go swimming today. Wait, ___

what comes next? Well, not something that would usually follow without the word "wait", probably something entirely orthogonal that impacts the earlier sentence in some fundamental way, like:

Wait, I need to help my dad.

replies(1): >>42960020 #

9. zoogeny ◴[05 Feb 25 21:12 UTC] No.42955196[source]▶

>>42951764 #

This is something I have been suppressing since I don't want to become chicken little. Anyone who isn't terrified by the last 3 months probably doesn't really understand what is happening.

I went from accepting I wouldn't see a true AI in my lifetime, to thinking it is possible before I die, to thinking it is possible in in the next decade, to thinking it is probably in the next 3 years to wondering if we might see it this year.

Just 6 months ago people were wondering if pre-training was stalling out and if we hit a wall. Then deepseek drops with RL'd inference time compute, China jumps from being 2 years behind in the AI race to being neck-and-neck and we're all wondering what will happen when we apply those techniques to the current full-sized behemoth models.

It seems the models that are going to come out around summer time may be jumps in capability beyond our expectations. And the updated costs means that there may be several open source alternatives available. The intelligence that will be available to the average technically literate individual will be frightening.

replies(2): >>42956212 #>>42963164 #

10. teruakohatu ◴[05 Feb 25 21:13 UTC] No.42955228[source]▶

>>42953577 #

> still have no real comprehensive understanding how the models work.

We do understand how they work, we just have not optimised their usage.

For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.

But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.

replies(3): >>42955842 #>>42955941 #>>42962716 #

11. andrewfromx ◴[05 Feb 25 21:51 UTC] No.42955744[source]▶

>>42951829 #

yes, it also feels like we are going to lose our just-in-time global shipments of anything to anywhere any day now. It will soon feel like 1900 in other ways.

replies(2): >>42958710 #>>42962278 #

12. spiorf ◴[05 Feb 25 21:58 UTC] No.42955842{3}[source]▶

>>42955228 #

We know how the next token is selected, but not why doing that repeatedly brings all the capabilities it does. We really don't understand how the emergent behaviours emerge.

replies(2): >>42958701 #>>43000550 #

13. gessha ◴[05 Feb 25 22:04 UTC] No.42955941{3}[source]▶

>>42955228 #

Your example is somewhat inadequate. We _fundamentally_ don’t understand how deep learning systems works in the sense that they are more or less black boxes that we train and evaluate. Innovations in ML are a whole bunch of wizards with big stacks of money changing “Hmm” to “Wait” and seeing what happens.

Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.

Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.

replies(2): >>42959322 #>>42960342 #

14. palmotea ◴[05 Feb 25 22:26 UTC] No.42956212{3}[source]▶

>>42955196 #

> The intelligence that will be available to the average technically literate individual will be frightening.

That's not the scary part. The scary part is the intelligence at scale that could be available to the average employer. Lots of us like to LARP that we're capitalists, but very few of us are. There's zero ideological or cultural framework in place to prioritize the well being of the general population over the profits of some capitalists.

AI, especially accelerating AI, is bad news for anyone who needs to work for a living. It's not going to lead to a Star Trek fantasy. It means an eventual phase change for the economy that consigns us (and most consumer product companies) to wither and fade away.

replies(3): >>42956628 #>>42960326 #>>42963042 #

15. cyanydeez ◴[05 Feb 25 22:45 UTC] No.42956436[source]▶

>>42951263 (TP) #

its fascinating how certain political movements avoid that Wait moment...

16. kevin009 ◴[05 Feb 25 22:53 UTC] No.42956535[source]▶

>>42951263 (TP) #

There are more than 10 different ways that I know for sure will improve LLMs just like `wait`. It is part if the CoT. I assume most researchers know this. CoT in old as 2019

replies(2): >>42967063 #>>42967235 #

17. 101008 ◴[05 Feb 25 23:02 UTC] No.42956628{4}[source]▶

>>42956212 #

I agree with you and I am scared. My problem is: if most people can't work, who is going to pay for the product/services created with IA?

I get a lot of "IA will allow us to create SaaS in a weekend" and "IA will take engineers jobs", which I think they both may be true. But a lot of SaaS surive because engineers pay for them -- if engineer don't exist anymore, a lot of SaaS won't either. If you eat your potential customers, creating quick SaaS doesn't make sense anymore (yeah, there are exceptions, etc., I know).

replies(2): >>42957011 #>>42959261 #

18. lostmsu ◴[05 Feb 25 23:07 UTC] No.42956674[source]▶

>>42951263 (TP) #

Hm, I am surprised that people who are presumably knowledgeable with how attention works are surprised by this. The more tokens in the output, the more computation the model is able to do overall. Back in September, when I was testing my iOS hands-free voice AI prototype that was powered by 8B LLM, when I wanted it to give really thoughtful answers to philosophical questions, I would instruct it to output several hundred whitespace characters (because they are not read aloud) before the actual answer.

What I am more surprised about is why models actually seem to have to produce "internal thoughts" instead of random tokens. Maybe during training having completely random tokens in thinking section derailed the model's thought process in a same way background noise can derail ours?

19. pertymcpert ◴[05 Feb 25 23:41 UTC] No.42956999[source]▶

>>42953577 #

For quantization I don't think that's really true. Quantization is just making more efficient use of bits in memory to represent numbers.

20. ZeljkoS ◴[05 Feb 25 23:42 UTC] No.42957002[source]▶

>>42953577 #

We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.

Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."

replies(2): >>42959347 #>>42965862 #

21. immibis ◴[05 Feb 25 23:42 UTC] No.42957011{5}[source]▶

>>42956628 #

Those people will simply be surplus to requirements. They'll be left alone as long as they don't get in the way of the ruling class, and disposed of if they do. As usual in history.

replies(1): >>42959650 #

22. versteegen ◴[06 Feb 25 00:10 UTC] No.42957256{3}[source]▶

>>42952687 #

Nice. Though I couldn't understand those "opening lines" until I read in your Introduction:

> For Land, capitalism begins in Northern Italy around 1500 with "the emerging world of technologists and accountants", the spiral interexcitation of "oceanic navigation and place-value calculation", and zero-unlocked double-entry book-keeping

Fibonacci, amongst many others, played a critical role that highly accelerative technology.

23. deadbabe ◴[06 Feb 25 01:28 UTC] No.42957820[source]▶

>>42951263 (TP) #

I mean the “wait” thing is obvious if you’ve ever asked an LLM to look at its own response and ask if it’s really sure about its answer.

24. rgovostes ◴[06 Feb 25 01:40 UTC] No.42957909[source]▶

>>42951263 (TP) #

> a branch of computer science

It should be considered a distinct field. At some level there is overlap (information theory, Kolmogorov complexity, etc.), but prompt optimization and model distillation is far removed from computability, formal language theory, etc. The analytical methods, the techniques to create new architectures, etc. are very different beasts.

replies(2): >>42958732 #>>42966331 #

25. BobbyTables2 ◴[06 Feb 25 03:29 UTC] No.42958693[source]▶

>>42951263 (TP) #

May sound like a conspiracy theory, but NVIDIA and a whole lot of AI startups have a strong vested interest to not seek+publish such findings.

If I don’t need a huge model and GPU, then AI is little more than an open source program running on an idle PC.

I feel like AI was NVIDIA’s lifeboat as GPU mining waned. Don’t see anything after that in the near future.

replies(1): >>42958891 #

26. Valgrim ◴[06 Feb 25 03:30 UTC] No.42958701{4}[source]▶

>>42955842 #

It feels less like a word prediction algorithm and more like a world model compression algorithm. Maybe we tried to create one and accidentaly created the other?

replies(2): >>42960470 #>>42962374 #

27. BobbyTables2 ◴[06 Feb 25 03:31 UTC] No.42958710{3}[source]▶

>>42955744 #

We’ll have to raise our own chickens too…

28. BobbyTables2 ◴[06 Feb 25 03:35 UTC] No.42958732[source]▶

>>42957909 #

Almost seems more like computer engineering. Is it really that different than signal/image processing?

I suspect CS departments don’t want to concede because they are now in the limelight…

29. philipswood ◴[06 Feb 25 03:56 UTC] No.42958891[source]▶

>>42958693 #

I think NVIDIAs future is pretty bright.

We're getting to the run-your-capable-LLM on-prem or at-home territory.

Without DeepSeek (and hopefully its successors) I wouldn't really have a usecase for something like NVIDIAs Project Digits.

https://www.nvidia.com/en-us/project-digits/

replies(1): >>42959619 #

30. pizza ◴[06 Feb 25 04:32 UTC] No.42959107{3}[source]▶

>>42952687 #

Hope we get the Nick Land the younger, and not Nick Land the elder, set of outcomes. Somewhere, sometime, along the way, it seems like everything from CCRU and Duginism leapt out of the page into the real. Maybe it's just the beginning of the Baudrilliardian millennium.

31. MR4D ◴[06 Feb 25 04:42 UTC] No.42959159[source]▶

>>42953577 #

I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.

replies(4): >>42959654 #>>42963668 #>>42966553 #>>43000430 #

32. palmotea ◴[06 Feb 25 05:00 UTC] No.42959261{5}[source]▶

>>42956628 #

> My problem is: if most people can't work, who is going to pay for the product/services created with IA?

A lot of those will probably go under, too. I think a lot of people are in for a rude awakening.

The only people our society and economy really values are the elite with ownership and control, and the people who get to eat and have comfort are those who provide things that are directly or indirectly valuable to that elite. AI will enable a game of musical chairs, with economic participants iteratively eliminated as the technology advances, until there are only a few left controlling vast resources and capabilities, to be harnessed for personal whims. The rest of us will be like rats in a city, scraping by on the margins, unwanted, out of sight, subsisting on scraps, perhaps subject to "pest control" regimes.

replies(2): >>42960870 #>>42962124 #

33. brookst ◴[06 Feb 25 05:14 UTC] No.42959322{4}[source]▶

>>42955941 #

Isn't that just scale? Even small LLMs have more parts than any car.

LLMs are more analogous to economics, psychology, politics -- it is possible there's a core science with explicability, but the systems are so complex that even defining the question is hard.

replies(2): >>42959929 #>>42961952 #

34. 3abiton ◴[06 Feb 25 05:19 UTC] No.42959347{3}[source]▶

>>42957002 #

So more 'mature' models might arise in the near future with less params and better benchmarks?

replies(3): >>42960280 #>>42960288 #>>42961518 #

35. katzenversteher ◴[06 Feb 25 06:00 UTC] No.42959520[source]▶

>>42954518 #

I bet a token like "sht!", "f*" or "damn!" would have the same or even stronger effect but the LLM creators would not like to have the users read them

replies(3): >>42959617 #>>42960035 #>>42960519 #

36. lodovic ◴[06 Feb 25 06:22 UTC] No.42959617{3}[source]▶

>>42959520 #

I think you're onto something, however, as the training is done through on text and not actual thoughts, it may take some experimentation to find these stronger words.

37. Arn_Thor ◴[06 Feb 25 06:22 UTC] No.42959619{3}[source]▶

>>42958891 #

Except I can run R1 1.5b on a GPU-less and NPU-less Intel NUC from four-five years ago using half its cores and the reply speed is…functional.

As the models have gotten more efficient and distillation better the minimum viable hardware for really cooking with LLMs has gone from a 4090 to suddenly something a lot of people already probably own.

I definitely think a Digits box would be nice, but honestly I’m not sure I’ll need one.

replies(2): >>42963040 #>>43000736 #

38. lodovic ◴[06 Feb 25 06:30 UTC] No.42959650{6}[source]▶

>>42957011 #

That's a fallacy. You can't have an advanced economy with most people sitting on the side. Money needs to keep flowing. If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?

replies(2): >>42959792 #>>42963522 #

39. umeshunni ◴[06 Feb 25 06:30 UTC] No.42959654{3}[source]▶

>>42959159 #

> in that a distilled model of an LLM is like a JPEG of a photo

That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.

replies(3): >>42960472 #>>42961599 #>>42962196 #

40. palmotea ◴[06 Feb 25 06:54 UTC] No.42959792{7}[source]▶

>>42959650 #

> You can't have an advanced economy with most people sitting on the side.

If AI lives up to the hype, that will become possible.

> If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?

I don't think it would be that. There'd also be power generation, manufacturing, mining, and construction, etc.; but all extremely automated. If you get to truly extreme levels of wealth concentration, things would shift out of our capitalist market system model, and concepts like "profit" would become anachronisms.

It actually might kinda look like the "economy" of Starcraft: you gather resources, decide what to build with them, and order it all around according to your whim. There will be a handful of guys playing, and everyone else will be a NPC.

replies(1): >>42960350 #

41. ChymeraXYZ ◴[06 Feb 25 07:16 UTC] No.42959929{5}[source]▶

>>42959322 #

Could be, but it does not change the fact that we do not understand them as of now.

42. ascorbic ◴[06 Feb 25 07:35 UTC] No.42960020{3}[source]▶

>>42954757 #

Yes, R1 seems to mostly use it like that. It's either to signal a problem with its previous reasoning, or if it's thought of a better approach. In coding it's often something like "this API won't work here" or "there's a simpler way to do this".

replies(1): >>43000689 #

43. ascorbic ◴[06 Feb 25 07:37 UTC] No.42960035{3}[source]▶

>>42959520 #

Maybe, but it doesn't just use it to signify that it's made a mistake. It also uses it in a positive way, such as it's had a lightbulb moment. Of course some people use expletives in the same way, but that would be less common than for mistakes.

44. raducu ◴[06 Feb 25 08:25 UTC] No.42960280{4}[source]▶

>>42959347 #

"Better", but not better than the model they were distilled from, at least that's how I understand it.

replies(1): >>42962035 #

45. andreasmetsala ◴[06 Feb 25 08:27 UTC] No.42960288{4}[source]▶

>>42959347 #

They might also be more biased and less able to adapt to new technology. Interesting times.

46. andreasmetsala ◴[06 Feb 25 08:34 UTC] No.42960326{4}[source]▶

>>42956212 #

> AI, especially accelerating AI, is bad news for anyone who needs to work for a living. It's not going to lead to a Star Trek fantasy. It means an eventual phase change for the economy that consigns us (and most consumer product companies) to wither and fade away.

How would that work? If there are no consumers then why even bother producing? If the cost of labor and capital trends towards zero then the natural consequence is incredible deflation. If the producers refuse to lower their prices then they either don’t participate in the market (which also means their production is pointless) or ensure some other way that the consumers can buy their products.

Our society isn’t really geared for handling double digit deflation so something does need to change if we really are accelerating exponentially.

replies(2): >>42963290 #>>42964429 #

47. raducu ◴[06 Feb 25 08:36 UTC] No.42960342{4}[source]▶

>>42955941 #

> _fundamentally_ don’t understand how deep learning systems works.

It's like saying we don't understand how quantum chromodynamics works. Very few people do, and it's the kind of knowledge not easily distilled for the masses in an easily digestible in a popsci way.

Look into how older CNNs work -- we have very good visual/accesible/popsci materials on how they work.

I'm sure we'll have that for LLM but it's not worth it to the people who can produce that kind of material to produce it now when the field is moving so rapidly, those people's time is much better used in improving the LLMs.

The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks.

replies(2): >>42961916 #>>42965302 #

48. andreasmetsala ◴[06 Feb 25 08:38 UTC] No.42960350{8}[source]▶

>>42959792 #

> It actually might kinda look like the "economy" of Starcraft: you gather resources, decide what to build with them, and order it all around according to your whim. There will be a handful of guys playing, and everyone else will be a NPC.

I guess if the “players” are sociopathic enough they might decide to just wipe out the NPCs. The possibility of someone like Putin or Musk becoming the sole member of the post-singularity humanity does make me pause.

replies(1): >>42960579 #

49. tomaskafka ◴[06 Feb 25 08:46 UTC] No.42960400[source]▶

>>42951263 (TP) #

One thing is to realize that we as humans have a thinking steps (internal monologue) before we output the texts. When LLMs produce text, we expect this thinking process to happen as well, but it does not - they are 'idiots that babble the first thing that comes to their minds'.

The above 'hack' is one of many realizations of the above differences.

50. codeulike ◴[06 Feb 25 08:55 UTC] No.42960464[source]▶

>>42951263 (TP) #

Wait, so the trick is they reach into the context and basically switch '</think>' with 'wait' and that makes it carry on thinking?

replies(3): >>42961113 #>>42962970 #>>42963406 #

51. codeulike ◴[06 Feb 25 08:56 UTC] No.42960470{5}[source]▶

>>42958701 #

Its almost like a Model of Language, but very Large

52. kedarkhand ◴[06 Feb 25 08:56 UTC] No.42960472{4}[source]▶

>>42959654 #

Well, JPEG can be thought of as an compression of the natural world of whose photograph was taken

replies(1): >>42962058 #

53. raducu ◴[06 Feb 25 09:05 UTC] No.42960519{3}[source]▶

>>42959520 #

It's literally in the article, they measured it and wait was the best token

54. cubefox ◴[06 Feb 25 09:17 UTC] No.42960579{9}[source]▶

>>42960350 #

That's assuming people like Altman can keep artificial superintelligence under human control. It very well may escape control and humanity would be disempowered forever. Or worse, wiped out.

55. nazgul17 ◴[06 Feb 25 10:06 UTC] No.42960870{6}[source]▶

>>42959261 #

This is the same conclusion I can't help but reach. I would love nothing more but to be convinced that (there is a chance that) that is not going to happen.

56. gield ◴[06 Feb 25 10:49 UTC] No.42961113[source]▶

>>42960464 #

Yes, that's explicitly mentioned in the blog post:

>In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".

57. coder543 ◴[06 Feb 25 11:53 UTC] No.42961518{4}[source]▶

>>42959347 #

That's been happening consistently for over a year now. Small models today are better than big models from a year or two ago.

58. homarp ◴[06 Feb 25 12:09 UTC] No.42961599{4}[source]▶

>>42959654 #

hence https://www.newyorker.com/tech/annals-of-technology/chatgpt-... (by Ted Chiang)

(discussed here: https://news.ycombinator.com/item?id=34724477 )

59. ozgune ◴[06 Feb 25 12:28 UTC] No.42961717[source]▶

>>42951263 (TP) #

Agreed. Here are three things that I find surreal about the s1 paper.

(1) The abstract changed how I thought about this domain (advanced reasoning models). The only other paper that did that for me was the "Memory Resource Management in VMware ESX Server". And that paper got published 23 years ago.

(2) The model, data, and code are open source at https://github.com/simplescaling/s1. With this, you can start training your own advanced reasoning models. All you need is a thousand well-curated questions with reasoning steps.

(3) More than half the references in the paper are from 2024 and Jan 2025. Just look at the paper's first page. https://arxiv.org/pdf/2501.19393 In which other field do you see this?

replies(1): >>42964022 #

60. gessha ◴[06 Feb 25 13:03 UTC] No.42961916{5}[source]▶

>>42960342 #

As a person who has trained a number of computer vision deep networks, I can tell you that we have some cool-looking visualizations on how lower layers work but no idea how later layers work. The intuition is built over training numerous networks and trying different hyperparameters, data shuffling, activations, etc. it’s absolutely brutal over here. If the theory was there, people like Karpathy who have great teacher vibes would’ve explained it for the mortal grad students or enthusiast tinkerers.

> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks

I say this less as an authoritative voice but more as an amused insider: Spend a week with some ML grad students and you will get a chuckle whenever somebody says we’re not some monkeys throwing things at GPUs.

replies(1): >>42962093 #

61. gessha ◴[06 Feb 25 13:08 UTC] No.42961952{5}[source]▶

>>42959322 #

You can make a bigger ICE engine (like a container ship engine) and still understand how the whole thing works. Maybe there’s more parts moving but it still has the structure of an ICE engine.

With neural networks big or small, we got no clue what’s going on. You can observe the whole system, from the weights and biases, to the activations, gradients, etc and still get nothing.

On the other hand, one of the reasons why economics, psychology and politics are hard is because we can’t open up people’s heads and define and measure what they’re thinking.

replies(1): >>42962060 #

62. salemba ◴[06 Feb 25 13:20 UTC] No.42962035{5}[source]▶

>>42960280 #

I think this is how the "child brain" works too. The better the parents and the environement are, the better the child evolution is :)

replies(1): >>43015969 #

63. bloomingkales ◴[06 Feb 25 13:22 UTC] No.42962058{5}[source]▶

>>42960472 #

And we can answer the question why quantization works with a lossy format, since quantization just drops accuracy for space but still gives us a good enough output, just like a lossy jpeg.

Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).

64. ijk ◴[06 Feb 25 13:23 UTC] No.42962060{6}[source]▶

>>42961952 #

One way I've heard it summarized: Computer Science as a field is used to things being like physics or chemistry, but we've suddenly encountered something that behaves more like biology.

replies(1): >>42962185 #

65. bloomingkales ◴[06 Feb 25 13:29 UTC] No.42962093{6}[source]▶

>>42961916 #

It may be as simple as this:

https://youtube.com/shorts/7GrecDNcfMc

Many many layers of that. It’s not a profound mechanism. We can understand how that works, but we’re dumbfounded how such a small mechanism is responsible for all this stuff going on inside a brain.

I don’t think we don’t understand, it’s a level beyond that. We can’t fathom the implications, that it could be that simple, just scaled up.

replies(1): >>42965342 #

66. kortilla ◴[06 Feb 25 13:34 UTC] No.42962124{6}[source]▶

>>42959261 #

> The only people our society and economy really values are the elite with ownership and control

This isn’t true. The biggest companies are all rich because they cater to the massive US middle class. That’s where the big money is at.

replies(1): >>42963146 #

67. timschmidt ◴[06 Feb 25 13:42 UTC] No.42962196{4}[source]▶

>>42959654 #

And what is compression but finding the minimum amount of information required to reproduce a phenomena? I.e. discovering natural laws.

replies(1): >>42964657 #

68. eru ◴[06 Feb 25 13:52 UTC] No.42962278{3}[source]▶

>>42955744 #

Hope we don't get 1914 again, too.

69. bloomingkales ◴[06 Feb 25 14:01 UTC] No.42962374{5}[source]▶

>>42958701 #

Why would asking a question about ice cream trigger a consideration about all possible topics? As in, to formulate the answer, the LLM will consider the origin of Elephants even. It won’t be significant, but it will be factored in.

Why? In the spiritual realm, many postulated that even the Elephant you never met is part of your life.

None of this is a coincidence.

70. adamc ◴[06 Feb 25 14:36 UTC] No.42962716{3}[source]▶

>>42955228 #

The "Wait" vs. "Hmm" discussion in the paper does not suggest we know how they work. If we knew, we wouldn't have to try things and measure to figure out the best prompt.

71. luc4sdreyer ◴[06 Feb 25 14:58 UTC] No.42962970[source]▶

>>42960464 #

Yes, that's one of the tricks.

72. red1reaper ◴[06 Feb 25 15:04 UTC] No.42963026{8}[source]▶

>>42962185 #

"God" as a concept in unproven to exist, it is also impossible to prove, so for all intents and porpouses it doesn't exist.

replies(1): >>42963535 #

73. nickthegreek ◴[06 Feb 25 15:05 UTC] No.42963040{4}[source]▶

>>42959619 #

R1 1.5b won’t do what most people want at all.

replies(1): >>42965746 #

74. luc4sdreyer ◴[06 Feb 25 15:06 UTC] No.42963042{4}[source]▶

>>42956212 #

That is assuming the accelerating AI stays under human control.

We're racing up a hill at an ever-increasing speed, and we don't know what's on the other side. Maybe 80% chance that it's either nothing or "simply" a technological revolution.

75. palmotea ◴[06 Feb 25 15:16 UTC] No.42963146{7}[source]▶

>>42962124 #

> This isn’t true. The biggest companies are all rich because they cater to the massive US middle class..

It is true, but I can see why you'd be confused. Let me ask you this: if members of the "the massive US middle class" can be replaced with automation, are those companies going 1) to keep paying those workers to support the middle-class demand which made them rich, or are they going to 2) fire them so more money can be shoveled up to the shareholders?

The answer is obviously #2, which has been proven time and again (e.g. how we came to have "the Rust Belt").

> That’s where the big money is at

Now, but not necessarily in the future. I think AI (if it doesn't hit a wall) will change that, maybe not instantaneously, but over time.

replies(2): >>42963733 #>>43042399 #

76. pjc50 ◴[06 Feb 25 15:18 UTC] No.42963164{3}[source]▶

>>42955196 #

This frightens mostly people whose identity is built around "intelligence", but without grounding in the real world. I've yet to see really good articulations of what, precisely we should be scared of.

Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them. And the problem of "human alignment" is one unsolved since Cain and Abel.

AI alone is words on a screen.

The sibling thread details the "mass unemployment" scenario, which would be destabilizing, but understates how much of the current world of work is still physical. It's a threat to pure desk workers, but we're not the majority of the economy.

Perhaps there will be political instability, but .. we're already there from good old humans.

replies(4): >>42963468 #>>42964183 #>>42965461 #>>43000641 #

77. palmotea ◴[06 Feb 25 15:30 UTC] No.42963290{5}[source]▶

>>42960326 #

> How would that work? If there are no consumers then why even bother producing?

Whim and ego. I think the advanced economy will shift to supporting trillionaires doing things like "DIY home improvement" for themselves. They'll own a bunch of automated resources (power generation, mining, manufacturing, AI engineers), and use it to do whatever they want. Build pyramids on the moon, while the now economically-useless former middle-class laborers shiver in the cold? Sure, why not?

78. danans ◴[06 Feb 25 15:41 UTC] No.42963406[source]▶

>>42960464 #

Not sure if your pun was intended, but 'wait' probably works so well because of the models being trained on text structured like your comment, where "wait" is followed by a deeper understanding.

79. danans ◴[06 Feb 25 15:46 UTC] No.42963468{4}[source]▶

>>42963164 #

> without grounding in the real world.

> I've yet to see really good articulations of what, precisely we should be scared of. Bedroom superweapons?

Loss of paid employment opportunities and increasing inequality are real world concerns.

UBI isn't coming by itself.

replies(2): >>42963487 #>>42965543 #

80. pjc50 ◴[06 Feb 25 15:47 UTC] No.42963487{5}[source]▶

>>42963468 #

Sure, but those are also real world concerns in the non-AI alternate timeline. As is the unlikelihood of UBI.

replies(1): >>42963556 #

81. danans ◴[06 Feb 25 15:50 UTC] No.42963522{7}[source]▶

>>42959650 #

> Money needs to keep flowing. If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?

Plenty of profit was made off feudalism, and technofeudalism has all the tools of modern technology at its disposal. If things go in that direction, they will have an unlimited supply of serfs desperate for whatever human work/livelihood is left.

replies(1): >>42963757 #

82. ◴[06 Feb 25 15:51 UTC] No.42963535{9}[source]▶

>>42963026 #

83. danans ◴[06 Feb 25 15:53 UTC] No.42963556{6}[source]▶

>>42963487 #

Yes, but they are likely dramatically accelerated in the AI timeline.

84. ziofill ◴[06 Feb 25 16:03 UTC] No.42963668{3}[source]▶

>>42959159 #

What you say makes sense, but is there the possibility that because it’s compressed it can generalize more? In the spirit of bias/variance.

85. soco ◴[06 Feb 25 16:09 UTC] No.42963733{8}[source]▶

>>42963146 #

So you end up with a huge starved mob trying to come all over your mansions and islands. I somehow think Musk totally capable of nuking those mobs, or unleash the (future) AI dogs over them, because the mob cannot produce anymore (because of AI) and cannot pay anymore (because no jobs because of AI). So the mob will be totally worthless to this style of "capitalism". Really why would they bother with UBI when they can let the mob just die out?

replies(1): >>42969220 #

86. soco ◴[06 Feb 25 16:10 UTC] No.42963757{8}[source]▶

>>42963522 #

Unlimited supply yes, but highly limited usage for them. So even if a few will work for free, the rest will be starving, and angry.

87. pradn ◴[06 Feb 25 16:34 UTC] No.42964022[source]▶

>>42961717 #

Omg, another fan of "Memory Resource Management in VMware ESX Server"!! It's one of my favorite papers ever - so clever.

88. pradn ◴[06 Feb 25 16:36 UTC] No.42964057[source]▶

>>42951263 (TP) #

I mean is "wait" even the ideal "think more please" phrase? Would you get better results with other phrases like "wait, a second", or "let's double-check everything"? Or domain-dependent, specific instructions for how to do the checking? Or forcing tool-use?

89. ben_w ◴[06 Feb 25 16:50 UTC] No.42964183{4}[source]▶

>>42963164 #

> This frightens mostly people whose identity is built around "intelligence", but without grounding in the real world.

It has certainly had this impact on my identity; I am unclear how well-grounded I really am*.

> I've yet to see really good articulations of what, precisely we should be scared of.

What would such an articulation look like, given you've not seen it?

> Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them.

Even with current limited systems — which are not purely desk workers, they're already being connected to and controlling robots, even by amateurs — AI lowers the minimum human skill level needed to do those things.

The fear is: how far are we from an AI that doesn't need a human in the loop? Because ChatGPT was almost immediately followed by ChaosGPT, and I have every reason to expect people to continue to make clones of ChaosGPT continuously until one is capable of actually causing harm. (As with 3d-printed guns, high chance the first ones will explode in the face of the user rather than the target).

I hope we're years away, just as self driving cars turned out to be over-promised and under-delivered for the last decade — even without a question of "safety", it's going to be hard to transition the world economy to one where humans need not apply.

> And the problem of "human alignment" is one unsolved since Cain and Abel.

Yes, it is unsolved since time immemorial.

This has required us to not only write laws, but also design our societies and institutions such that humans breaking laws doesn't make everything collapse.

While I dislike the meme "AI == crypto", one overlap is that both have nerds speed-running discovering how legislation works any why it's needed — for crypto, specifically financial legislation after it explodes in their face; for AI, to imbue the machine with a reason to approximate society's moral code, because they see the problem coming.

* Dunning Kruger applies; and now I have first-hand experience of what this feels like from the inside, as my self-perception of how competent I am at German has remained constant over 7 years of living in Germany and improving my grasp of the language the entire time.

90. ben_w ◴[06 Feb 25 17:18 UTC] No.42964429{5}[source]▶

>>42960326 #

> If there are no consumers then why even bother producing?

> If the producers refuse to lower their prices then they either don’t participate in the market (which also means their production is pointless) or ensure some other way that the consumers can buy their products.

Imagine you're a billionaire with a data centre and golden horde of androids.

You're the consumer, the robots make stuff for you; they don't make stuff for anyone else, just you, in the same way and for the same reason that your power tools and kitchen appliances don't commute to work — you could, if you wanted, lend them to people, just like those other appliances, but you'd have to actually choose to, it wouldn't be a natural consequence of the free market.

Their production is, indeed, pointless. This doesn't help anyone else eat. The moment anyone can afford to move from "have not" to "have", they drop out of the demand market for everyone else's economic output.

I don't know how big the impact of dropping out would be: the right says "trickle down economics" is good and this would be the exact opposite of that; while the left criticism's of trickle-down economics is that in practice the super-rich already have so much stuff that making them richer doesn't enrich anyone else who might service them, so if the right is correct then this is bad but if the left is correct then this makes very little difference.

Unfortunately, "nobody knows" is a great way to get a market panic all by itself.

91. t_mann ◴[06 Feb 25 17:43 UTC] No.42964657{5}[source]▶

>>42962196 #

Finding minimum complexity explanations isn't what finding natural laws is about, I'd say. It's considered good practice (Occam's razor), but it's often not really clear what the minimal model is, especially when a theory is relatively new. That doesn't prevent it from being a natural law, the key criterion is predictability of natural phenomena, imho. To give an example, one could argue that Lagrangian mechanics requires a smaller set of first principles than Newtonian, but Newton's laws are still very much considered natural laws.

replies(1): >>42965278 #

92. timschmidt ◴[06 Feb 25 18:49 UTC] No.42965278{6}[source]▶

>>42964657 #

Maybe I'm just a filthy computationalist, but the way I see it, the most accurate model of the universe is the one which makes the most accurate predictions with the fewest parameters.

The Newtonian model makes provably less accurate predictions than Einsteinian (yes, I'm using a different example), so while still useful in many contexts where accuracy is less important, the number of parameters it requires doesn't much matter when looking for the one true GUT.

My understanding, again as a filthy computationalist, is that an accurate model of the real bonafide underlying architecture of the universe will be the simplest possible way to accurately predict anything. With the word "accurately" doing all the lifting.

As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/Nature%20of%20...

I'm sure there are decreasingly accurate, but still useful, models all the way up the computational complexity hierarchy. Lossy compression is, precisely, using one of them.

replies(1): >>42966955 #

93. ClumsyPilot ◴[06 Feb 25 18:51 UTC] No.42965302{5}[source]▶

>>42960342 #

> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work

Just like alchemists made enormous strides in chemistry, but their goal was to turn piss into gold.

94. ClumsyPilot ◴[06 Feb 25 18:56 UTC] No.42965342{7}[source]▶

>>42962093 #

> Many many layers of that. It’s not a profound mechanism

Bad argument. Cavemen understood stone, but they could not build the aqueducts. Medieval people understood iron, water and fire but they could not make a steam engine

Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make

replies(1): >>42965612 #

95. zoogeny ◴[06 Feb 25 19:10 UTC] No.42965461{4}[source]▶

>>42963164 #

Some of the scariest horror movies are the ones where the monster isn't shown. Often once the monster is shown, it is less terrifying.

In a general sense, uncertainty causes anxiety. Once you know the properties of the monster you are dealing with you can start planning on how to address it.

Some people have blind and ignorant confidence. A feeling they can take on literally anything, no matter how powerful. Sometimes they are right, sometimes they are wrong.

I'm reminded by the scene in No Country For Old Men where the good guy bad-ass meets the antagonist and immediately dies. I have little faith in blind confidence.

edit: I'll also add that human adaptability (which is probably the trait most confidence in humans would rest) has shown itself capable of saving us from many previous civilization changing events. However, this change with AI is happening much, much faster than any before it. So part of the anxiety is whether or not our species reaction time is enough to avoid the cliff we are accelerating towards.

96. mvieira38 ◴[06 Feb 25 19:18 UTC] No.42965543{5}[source]▶

>>42963468 #

Worst case scenario humans mostly go back to manual labor, which would fix a lot of modern day ailments such as obesity and (some) mental health struggles, with added enormous engineering advancements based on automatic research.

replies(1): >>43033687 #

97. bloomingkales ◴[06 Feb 25 19:26 UTC] No.42965612{8}[source]▶

>>42965342 #

"Cavemen understood stone"

How far removed are you from a caveman is the better question. There would be quite some arrogance coming out of you to suggest the several million years gap is anything but an instant in the grand timeline. As in, you understood stone just yesterday ...

The monkey that found the stone is the monkey that built the cathedral. It's only a delusion the second monkey creates to separate it from the first monkey (a feeling of superiority, with the only tangible asset being "a certain amount of notable time passed since point A and point B").

"Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make"

You and I agree. That those simple things can truly create infinite possibilities. That's all I was saying, we cannot fathom it (either because infinity is hard to fathom, or that it's origins are humble - just a few core elements, or both, or something else).

Anyway, this can discussion can head into any direction.

98. Arn_Thor ◴[06 Feb 25 19:39 UTC] No.42965746{5}[source]▶

>>42963040 #

No, it won't. But that's not the point I was making

99. Arthur_ODC ◴[06 Feb 25 19:52 UTC] No.42965862{3}[source]▶

>>42957002 #

So, can a distilled 8B model (say, the Deepseek-R1-Distil-Llama-8B or whatever) be "trained up" to a higher parameter 16B Parameter model after distillation from a superior model, or is it forever stuck at the 8B parameters that can just be fine tuned?

100. maginx ◴[06 Feb 25 20:47 UTC] No.42966331[source]▶

>>42957909 #

I agree - I don't know what field it formally is, but computer science it is not. It is also related to information retrieval aka "Google skills", problem presentation, 'theory of mind', even management and psychology. I'm saying the latter because people often ridicule AI responses for giving bad answers that are 'too AI'. But often it is simply because not enough context-specific information was given to allow the AI to giving a more personalized response. One should compare the response to "If I had asked a random person on the internet this query, what might I have gotten". If you write "The response should be written as a <insert characteristics, context, whatever you feel is relevant>" it will deliver a much less AI. This is just as much about how you pose a problem in general, as it is about computer science.

101. cztomsik ◴[06 Feb 25 20:53 UTC] No.42966394[source]▶

>>42953577 #

Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.

"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.

Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.

(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")

102. cmgriffing ◴[06 Feb 25 21:12 UTC] No.42966553{3}[source]▶

>>42959159 #

This brings up an interesting thought too. A photo is just a lossy representation of the real world.

So it's lossy all the way down with LLMs, too.

Reality > Data created by a human > LLM > Distilled LLM

103. t_mann ◴[06 Feb 25 22:03 UTC] No.42966955{7}[source]▶

>>42965278 #

The thing is, Lagrangian mechanics makes exactly the same predictions as Newtownian, and it starts from a foundation of just one principle (least action) instead of three laws, so it's arguably a sparser theory. It just makes calculations easier, especially for more complex systems, that's its raison d'être. So in a world where we don't know about relativity yet, both make the best predictions we know (and they always agree), but Newton's laws were discovered earlier. Do they suddenly stop being natural laws once Lagrangian mechanics is discovered? Standard physics curricula would not agree with you btw, they practically always teach Newtownian mechanics first and Lagrangian later, also because the latter is mathematically more involved.

replies(3): >>42967070 #>>42967186 #>>42986201 #

104. Melatonic ◴[06 Feb 25 22:15 UTC] No.42967063[source]▶

>>42956535 #

Mind elaborating ?

105. timschmidt ◴[06 Feb 25 22:15 UTC] No.42967070{8}[source]▶

>>42966955 #

> Do they suddenly stop being natural laws once Lagrangian mechanics is discovered?

Not my question to answer, I think that lies in philosophical questions about what is a "law".

I see useful abstractions all the way down. The linked Asimov essay covers this nicely.

106. dragonwriter ◴[06 Feb 25 22:32 UTC] No.42967186{8}[source]▶

>>42966955 #

Laws (in science, not government) are just a relationship that is consistently observed, so Newton's laws remain laws until contradictions were observed, regardless of the existence of or more alternative models which would predict them to hold.

The kind of Occam’s Razor-ish rule you seem to be trying to query about is basically a rule of thumb for selecting among formulations of equal observed predictive power that are not strictly equivalent (that is, if they predict exactly the same actually observed phenomenon instead of different subsets of subjectively equal importance, they still differ in predictions which have not been testable), whereas Newtonian and Lagrangian mechanics are different formulations that are strictly equivalent, which means you may choose between them for pedagogy or practical computation, but you can't choose between them for truth because the truth of one implies the truth of the other, in either direction; they are the exactly the same in sibstance, differing only in presentation.

(And even where it applies, its just a rule of thumb to reject complications until they are observed to be necessary.)

replies(1): >>42979920 #

107. kristianp ◴[06 Feb 25 22:40 UTC] No.42967235[source]▶

>>42956535 #

Chain of thought (CoT)?

108. palmotea ◴[07 Feb 25 04:00 UTC] No.42969220{9}[source]▶

>>42963733 #

> Really why would they bother with UBI when they can let the mob just die out?

Personally, I think UBI is a ploy to keep the "huge starved mob[s]" pacified during the transition, when they still have enough power to act, before the tech oligarchs fully cement their control.

Once the common people are powerless to protect themselves and their interests, then they'll be left to die out.

109. t_mann ◴[08 Feb 25 02:49 UTC] No.42979920{9}[source]▶

>>42967186 #

Newtownian and Lagrangian mechanics are equivalent only in their predictions, not in their complexity - one requires three assumptions, the other just one. Now you say the fact that they have the same predictions makes them equivalent, and I agree. But it's clearly not compatible with what the other poster said about looking for the simplest possible way to explain a phenomenon. If you believe that that's how science should work, you'd need to discard theories as soon as simpler ones that make the same predictions are found (as in the case of Newtownian mechanics). It's a valid philosophical standpoint imho, but it's in opposition to how scientists generally approach Occam's razor, as evidenced eg by common physics curricula. That's what I was pointing out. Having to exclude Newtownian mechanics from what can be considered science is just one prominent consequence of the other poster's philosophical stance, one that could warrant reconsidering whether that's how you want to define it.

110. Cleonis ◴[08 Feb 25 21:05 UTC] No.42986201{8}[source]▶

>>42966955 #

I will argue that 'has least action as foundation' does not in itself imply that Lagrangian mechanics is a sparser theory:

Here is something that Newtonian mechanics and Lagrangian mechanics have in common: it is necessary to specify whether the context is Minkowski spacetime, or Galilean spacetime.

Before the introduction of relativistic physics the assumption that space is euclidean was granted by everybody. The transition from Newtonian mechanics to relativistic mechanics was a shift from one metric of spacetime to another.

In retrospect we can recognize Newton's first law as asserting a metric: an object in inertial motion will in equal intervals of time traverse equal distances of space.

We can choose to make the assertion of a metric of spacetime a very wide assertion: such as: position vectors, velocity vectors and acceleration vectors add according to the metric of the spacetime.

Then to formulate Newtonian mechanics these two principles are sufficient: The metric of the spacetime, and Newton's second law.

Hamilton's stationary action is the counterpart of Newton's second law. Just as in the case of Newtonian mechanics: in order to express a theory of motion you have to specify a metric; Galilean metric or Minkowski metric.

To formulate Lagrangian mechanics: choosing stationary action as foundation is in itself not sufficent; you have to specify a metric.

So: Lagrangian mechanics is not sparser; it is on par with Newtonian mechanics.

More generally: transformation between Newtonian mechanics and Lagrangian mechanics is bi-directional.

Shifting between Newtonian formulation and Lagrangian formulation is similar to shifting from cartesian coordinates to polar coordinates. Depending on the nature of the problem one formulation or the other may be more efficient, but it's the same physics.

replies(1): >>42987402 #

111. t_mann ◴[09 Feb 25 00:25 UTC] No.42987402{9}[source]▶

>>42986201 #

You seem to know more about this than me, but it seems to me that the first law does more than just induce a metric, I've always thought of it as positing inertia as an axiom.

There's also more than one way to think about complexity. Newtownian mechanics in practice requires introducing forces everywhere, especially for more complex systems, to the point that it can feel a bit ad hoc. Lagrangian mechanics very often requires fewer such introductions and often results in descriptions with fewer equations and fewer terms. If you can explain the same phenomenon with fewer 'entities', then it feels very much like Occam's razor would favor that explanation to me.

replies(1): >>42993136 #

112. Cleonis ◴[09 Feb 25 19:43 UTC] No.42993136{10}[source]▶

>>42987402 #

Indeed inertia. Theory of motion consists of describing the properties of Inertia.

In terms of Newtonian mechanics the members of the equivalence class of inertial coordinate systems are related by Galilean transformation.

In terms of relativistic mechanics the members of the equivalence class of inertial coordinate systems are related by Lorentz transformation.

Newton's first law and Newton's third law can be grouped together in a single principle: the Principle of uniformity of Inertia. Inertia is uniform everywhere, in every direction.

That is why I argue that for Newtonian mechanics two principles are sufficient.

The Newtonian formulation is in terms of F=ma, the Lagrangian formulation is in terms of interconversion between potential energy and kinetic energy

The work-energy theorem expresses the transformation between F=ma and potential/kinetic energy The work-energy theorem: I give a link to an answer by me on physics.stackexchange where I derive the work-energy theorem https://physics.stackexchange.com/a/788108/17198

The work-energy theorem is the most important theorem of classical mechanics.

About the type of situation where the Energy formulation of mechanics is more suitable: When there are multiple degrees of freedom then the force and the acceleration of F=ma are vectorial. So F=ma has the property that the there are vector quantities on both sides of the equation.

When expressing in terms of energy: As we know: the value of kinetic energy is a single value; there is no directional information. In the process of squaring the velocity vector directional information is discarded, it is lost.

The reason we can afford to lose the directional information of the velocity vector: the description of the potential energy still carries the necessary directional information.

When there are, say, two degrees of freedom the function that describes the potential must be given as a function of two (generalized) coordinates.

This comprehensive function for the potential energy allows us to recover the force vector. To recover the force vector we evaluate the gradient of the potential energy function.

The function that describes the potential is not itself a vector quantity, but it does carry all of the directional information that allows us to recover the force vector.

I will argue the power of the Lagrangian formulation of mechanics is as follows: when the motion is expressed in terms of interconversion of potential energy and kinetic energy there is directional information only on one side of the equation; the side with the potential energy function.

When using F=ma with multiple degrees of freedom there is a redundancy: directional information is expressed on both sides of the equation.

Anyway, expressing mechanics taking place in terms of force/acceleration or in terms of potential/kinetic energy is closely related. The work-energy theorem expresses the transformation between the two. While the mathematical form is different the physics content is the same.

replies(1): >>42997618 #

113. t_mann ◴[10 Feb 25 06:59 UTC] No.42997618{11}[source]▶

>>42993136 #

Nicely said, but I think then we are in agreement that Newtownian mechanics has a bit of redundancy that can be removed by switching to a Lagrangian framework, no? I think that's a situation where Occam's razor can be applied very cleanly: if we can make the exact same predictions with a sparser model.

Now the other poster has argued that science consists of finding minumum complexity explanations of natural phenomena, and I just argued that the 'minimal complexity' part should be left out. Science is all about making good predictions (and explanations), Occam's razor is more like a guiding principle to help find them (a bit akin to shrinkage in ML) rather than a strict criterion that should be part of the definition. And my example to illustrate this was Newtonian mechanics, which in a complexity/Occam's sense should be superseded by Lagrangian, yet that's not how anyone views this in practice. People view Lagrangian mechanics as a useful calculation tool to make equivalent predictions, but nobody thinks of it as nullifying Newtownian mechanics, even though it should be preferred from Occam's perspective. Or, as you said, the physics content is the same, but the complexity of the description is not, so complexity does not factor into whether it's physics.

114. fennecfoxy ◴[10 Feb 25 14:01 UTC] No.43000399[source]▶

>>42951263 (TP) #

In a way it's the same thing as finding that models got lazier closer to Christmas, ie the "Winter Break" hypothesis.

Not sure what caused the above but In my opinion not only is the training affected by the date of training data (ie it refuses to answer properly because every year of the training data there was fewer or lower quality examples at the end of the year), or whether it's a cultural impression of humans talking about going on holiday/having a break etc in the training data at certain times and the model associating this with the meaning of "having a break".

I still wonder if we're building models wrong by training them on a huge amount of data from the Internet, then fine tuning for instruct where the model learns to make certain logical associations inherent or similar to the training data (which seems to introduce a myriad of issues like the strawberry problem or is x less than y being incorrect).

I feel like these models would have a lot more success if we trained a model to learn logic/problem solving separately without the core data set or to restrict the instruct fine tuning in some way so that we reduce the amount of "culture" it gleans from the data.

There's so much that we don't know about this stuff yet and it's so interesting to see something new in this field every day. All because of a wee paper on attention.

115. fennecfoxy ◴[10 Feb 25 14:05 UTC] No.43000430{3}[source]▶

>>42959159 #

Yeah but it does seem that they're getting high % numbers for the distilled models accuracy against the larger model. If the smaller model is 90% as accurate as the larger, but uses much < 90% of the parameters, then surely that counts as a win.

116. fennecfoxy ◴[10 Feb 25 14:16 UTC] No.43000550{4}[source]▶

>>42955842 #

Eh I feel like that mostly just down to; yes transformers are a "next token predictor" but during fine tuning for instruct the attention related wagon slapped on the back is partially hijacked as a bridge from input token->sequences of connections in the weights.

For example if I ask "If I have two foxes and I take away one, how many foxes do I have?" I reckon attention has been hijacked to essentially highlight the "if I have x and take away y then z" portion of the query to connect to a learned sequence from readily available training data (apparently the whole damn Internet) where there are plenty of examples of said math question trope, just using some other object type than foxes.

I think we could probably prove it by tracing the hyperdimensional space the model exists in and ask it variants of the same question/find hotspots in that space that would indicate it's using those same sequences (with attention branching off to ensure it replies with the correct object type that was referenced).

117. fennecfoxy ◴[10 Feb 25 14:25 UTC] No.43000641{4}[source]▶

>>42963164 #

Depends on the model I suppose. Atm everything is being heavily trained as LLMs without much capability outside of input text->output text aside from non-modelised calls out to the Internet/RAG system etc.

But at some point (still quite far away) I'm sure we'll start training a more general purpose model, or an LLM self-training will break outside of the "you're a language model" bounds and we'll end up with exactly that;

An LLM model in a self-training loop that breaks outside of what we've told it to be (a Language model), becomes a general purpose model and then becomes intelligent enough to do something like put itself out onto the Internet. Obviously we'd catch the feelers that it puts out and realise that this sort of behaviour is starting to happen, but imagine if we didn't? A model that trained itself to be general purpose but act like a constantly executing LLM, uploads itself to Hugging Face, gets run on thousands of clusters by people, because it's "best in class" and yes it's sitting there answering LLM type queries but also in the background is sending out beacons & communicating with itself between those clusters to...idk do something nefarious.

118. fennecfoxy ◴[10 Feb 25 14:29 UTC] No.43000689{4}[source]▶

>>42960020 #

I guess it goes to show how important reiteration is for general logic problems. And tbf when finding a solution to something myself I'll consider each part, and/or consider parts in relation to each other and/or consider all parts in relation to each other (on a higher level) before coming to a final solution.

It's weird because I feel like we should've known that from work in general logic/problem solving studies, surely?

119. fennecfoxy ◴[10 Feb 25 14:34 UTC] No.43000736{4}[source]▶

>>42959619 #

Yeah but what was R1 trained with? 50k GPUs as far as I've heard as well as distillation from OpenAI's models (basically leaning on their GPUs/GPU time).

Besides the fact that consumers will still always want GPUs for gaming, rendering, science compute etc.

No, I don't have any Nvidia stocks.

120. cristiancavalli ◴[11 Feb 25 18:04 UTC] No.43015969{6}[source]▶

>>42962035 #

Not at all — how many people were geniuses and their parents not? I can name several and I’m sure with a quick search you can too.

replies(1): >>43039510 #

121. n4r9 ◴[13 Feb 25 07:48 UTC] No.43033687{6}[source]▶

>>42965543 #

Manual labour jobs are not magically going to appear.

122. iFreilicht ◴[13 Feb 25 18:38 UTC] No.43039510{7}[source]▶

>>43015969 #

How is that relevant? A few examples do not disprove anything. It's pretty common knowledge that the more successful/rich etc. your parents were, the more likely you'll be successful/rich etc.

This does not directly prove the theory your parent comment posits, being that better circumstances during a child's development improve the development of that child's brain. That would require success being a good predictor of brain development, which I'm somewhat uncertain about.

123. kortilla ◴[13 Feb 25 22:41 UTC] No.43042399{8}[source]▶

>>42963146 #

It’s true, but I can see why you’d be confused. You conflated what the economy rewards (which is what caters to the large middle class pool of money) with what individual companies try to optimize for (eliminating labor costs).

↑