←back to thread

577 points simonw | 2 comments | | HN request time: 0.471s | source
Show context
bob1029 ◴[] No.44725490[source]
> still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.

I believe we are vastly underestimating what our existing hardware is capable of in this space. I worry that narratives like the bitter lesson and the efficient compute frontier are pushing a lot of brilliant minds away from investigating revolutionary approaches.

It is obvious that the current models are deeply inefficient when you consider how much you can decimate the precision of the weights post-training and still have pelicans on bicycles, etc.

replies(2): >>44725533 #>>44758837 #
jonas21 ◴[] No.44725533[source]
Wasn't the bitter lesson about training on large amounts of data? The model that he's using was still trained on a massive corpus (22T tokens).
replies(2): >>44725644 #>>44725671 #
1. itsalotoffun ◴[] No.44725671[source]
I think GP means that if you internalize the bitter lesson (more data more compute wins), you stop imagining how to squeeze SOTA minus 1 performance out of constrained compute environments.
replies(1): >>44730409 #
2. reactordev ◴[] No.44730409[source]
This. When we ran out of speed on the CPU, we moved to the GPU. Same thing here. The more we work with (22T) models, quants, and decimating precision - the more we learn and find more novel ways to do things.