How large are large language models?

(gist.github.com)

262 points rain1 | 4 comments | 02 Jul 25 10:39 UTC | HN request time: 0.238s | source

Show context

ljoshua ◴[02 Jul 25 13:00 UTC] No.44443222[source]▶

>>44442072 (OP) #

Less a technical comment and more just a mind-blown comment, but I still can’t get over just how much data is compressed into and available in these downloadable models. Yesterday I was on a plane with no WiFi, but had gemma3:12b downloaded through Ollama. Was playing around with it and showing my kids, and we fired history questions at it, questions about recent video games, and some animal fact questions. It wasn’t perfect, but holy cow the breadth of information that is embedded in an 8.1 GB file is incredible! Lossy, sure, but a pretty amazing way of compressing all of human knowledge into something incredibly contained.

replies(22): >>44443263 #>>44443274 #>>44443296 #>>44443751 #>>44443781 #>>44443840 #>>44443976 #>>44444227 #>>44444418 #>>44444471 #>>44445299 #>>44445966 #>>44446013 #>>44446775 #>>44447373 #>>44448218 #>>44448315 #>>44448452 #>>44448810 #>>44449169 #>>44449182 #>>44449585 #

Workaccount2 ◴[02 Jul 25 13:53 UTC] No.44443751[source]▶

>>44443222 #

I don't like the term "compression" used with transformers because it gives the wrong idea about how they function. Like that they are a search tool glued onto a .zip file, your prompts are just fancy search queries, and hallucinations are just bugs in the recall algo.

Although strictly speaking they have lots of information in a small package, they are F-tier compression algorithms because the loss is bad, unpredictable, and undetectable (i.e. a human has to check it). You would almost never use a transformer in place of any other compression algorithm for typical data compression uses.

replies(2): >>44443792 #>>44443846 #

1. angusturner ◴[02 Jul 25 14:01 UTC] No.44443846[source]▶

>>44443751 #

There is an excellent talk by Jack Rae called “compression for AGI”, where he shows (what I believe to be) a little known connection between transformers and compression;

In one view, you can view LLMs as SOTA lossless compression algorithms, where the number of weights don’t count towards the description length. Sounds crazy but it’s true.

replies(2): >>44444039 #>>44446793 #

2. Workaccount2 ◴[02 Jul 25 14:18 UTC] No.44444039[source]▶

>>44443846 (TP) #

A transformer that doesn't hallucinate (or knows what is a hallucination) would be the ultimate compression algorithm. But right now that isn't a solved problem, and it leaves the output of LLMs too untrustworthy to use over what are colloquially known as compression algorithms.

replies(1): >>44444868 #

3. Nevermark ◴[02 Jul 25 15:22 UTC] No.44444868[source]▶

>>44444039 #

It is still task related.

Compressing a comprehensive command line reference via model might introduce errors and drop some options.

But for many people, especially new users, referencing commands, and getting examples, via a model would delivers many times the value.

Lossy vs. lossless are fundamentally different, but so are use cases.

4. swyx ◴[02 Jul 25 17:55 UTC] No.44446793[source]▶

>>44443846 (TP) #

his talk here https://www.youtube.com/watch?v=dO4TPJkeaaU

and his last before departing for Meta Superintelligence https://www.youtube.com/live/U-fMsbY-kHY?si=_giVEZEF2NH3lgxI...

↑