The ram requirements at 200k seem to be using an assumed 16 bits? With a 4 bit quantization it's more like ~12GB of kv-cache for for 200k sequence length. Unless I'm missing something?
How much unlocking of a more active intelligence from LLMs, will far larger context length provide?
Would virtually unlimited context move us much closer to an LLM where training is continuous? Is that a bit of a holy grail? I assume not, but would love to know why.
I guess part of what I really don't understand is how context tokens compare to training weights, as far as value to the final response. Would a giant context window muddle the value of weights?
(Maybe what I am missing is the human-feedback on the training weights? If the giant system prompt I am imagining is garbage, then that would be bad.)
Remarkably, foundation models can learn new tasks via a few examples (called few-shot learning). LLM answers also significantly improve when given relevant supplemental information. Boosting context length: grows its "working memory"; provides richer knowledge to inform its reasoning; and expands its capacity for new tasks, given germane examples.
> Would virtually unlimited context move us much closer to an LLM where training is continuous?
No. You can already continually train a context-limited LLM. Virtually unlimited context window schemes also exist. Training is separate concept from context length. In pre-training, we work backward from a model's incorrect answer, tweaking its parameters to more likely say the correct thing next time. Fine-tuning is the same, but focusing on specific tasks important to the user. After training, when running the model (called inference), you can change the context length to suit your needs and tradeoffs.
> tweaking its parameters to more likely say the correct thing next time.
Is this entirely, or just partially done via human feedback on models like GPT-4 and LLama-3, for example?
Don't know how to work with my test system? No problem, here is an 800 page reference manual. Now you know.
You’ve got an apparent contradiction: SGD (AdamW at 1e-6 give or take) works. So we’ve got extremely abundant local maxima up to epsilon_0, but, it always lands in the same place, so there are abundant “well-studied” minima, likewise symmetrical up to epsilon_1, both of which are roughly: “start debugging if above we can tell”.
The maxima have meaningful curvature tensors at or adjacent to them: AdamW works.
But joker in the deck: control vectors work. So you’re in a quasi-Euclidean region.
In fact all the useful regions are about exactly the same, the weights are actually complex valued, everyone knows this part….
The conserved quantity up to let’s call it phi is compression ratio.
Maybe in a year or two when Altman is in jail and Mme. Su gives George cards that work, well crunch numbers more interesting than how much a googol FMA units cost and Emmy Nother gets some damned credit for knowing this a century ago.
Make a dissenting case or leave it the fuck alone. I’ve been pretty laid back about the OpenAI bot thing but I’m over it.
The way we train them involves finding steepness and chasing it, and it almost always works for a bit, often for quite a while. But the flat places it ends up are both really flat, and zillions of them are nearly identical.
Those two sets of nearly identically “places”, and in particular their difference in being useful via selection bias, are called together or separately a “gauge symmetry”, which basically means things remain true as you vary things a lot. The things that remain true are usually “conserved quantities”, and in the case of OpeenAI 100% compressing the New York Times, the conserved quantity is compression ratio up to some parameter or lossiness.
This is by no means the worst example this month. You run the best moderation team on the Internet, but no one at OpenAI (including Fidji) will flat deny they’re doing it, and it’s just obvious.
I know you’re doing yeoman’s work like always. Have someone let @sama that at least one person is going to start making charts. Not here.
LLMs are very effective at few-shot learning via so yes, for all practical purposes yes, large context windows do allow for continuous learning.
Note that the context needs to be loaded and processed on every request to the LLM though - so all that additional information has to be "retaught" each time.
[1] https://openreview.net/pdf?id=992eLydH8G "These results indicate that the equivalence between ICL (ed: In-context-learning) and GD (ed: Gradient Descent) is an open hypothesis, requires nuanced considerations, and calls for further studies.
This isn't correct (for most definitions of "unlock intelligence"). In-Context Learning (ICL) can do everything off-line training can do.
> We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases and can learn high-dimensional functions with numerical inputs.
e.g. https://ollama.com/library/dolphin-llama3:8b-256k-v2.9-q6_K
Also, your link appears to exactly address my question. It's late here, but I am very excited to do my best at understanding that paper in the morning.
The single-minded, singular goal of seeing Sam Altman answer to a criminal jury for his crimes?
Serious as a guy happy to talk to journalists. I won’t sleep a full night until he hears a verdict carried by a bailiff.
Any remaining confusion?
To those who don't understand the math (or its implications/connections to the shared link here, especially since the link is sparse on details unless you know where to look), I could see how a reader would lean toward the latter and downvote without comment.
“Jail Altman” is basically my sig now. And no one wonders why someone would be passionate about that. Not in good faith.
The uniformity (gauge symmetry up to a bound) of such regions is a way of thinking about the apparent contradiction between the properties of a billion dimensional space before and after a scalar loss pushing a gradient around in it.
What are the hardware requirements to run the model I gave you?
Based on the provided document, the hardware requirements for running the llama3-gradient model depend on the desired context window size:
To use a 256k context window, at least 64GB of memory is required. To use a 1M+ context window, significantly more memory (100GB+) is needed.
The exact requirements are not specified, but it's clear that running this model with very large context windows requires a machine with a substantial amount of RAM, likely a high-end workstation or server rather than a typical consumer PC. The computational demands also likely necessitate a powerful CPU and/or GPU. CopyMWwhat type of CPU and what type of GPU's and how many would it take?The document does not provide specific details on the CPU or GPU requirements for running the llama3-gradient model. However, based on the memory requirements mentioned and the nature of the model, we can make some educated guesses:
CPU:
For optimal performance, a high-end, multi-core server-grade CPU would likely be necessary, such as an Intel Xeon or AMD EPYC processor with a high core count (e.g., 16+ cores).
GPU:
Running large language models often benefits greatly from the use of one or more powerful GPUs. High-end consumer GPUs like the Nvidia GeForce RTX 3090 (24GB VRAM) or data center GPUs like the Nvidia A100 (40GB-80GB VRAM) would likely be suitable.
The number of GPUs required would depend on the specific implementation and the desired performance. A single high-end GPU might be sufficient, but using multiple GPUs could speed up processing.
Keep in mind that these are estimates based on the provided information. The actual hardware requirements may vary depending on the specific implementation, optimizations, and performance requirements. Running such a model with very large context windows would likely require a powerful, expensive system with ample RAM, fast storage, and robust cooling. Detailed benchmarks or recommendations from the model's developers would provide more definitive guidance on the exact hardware needed.
These days we have KL-divergence and information gain and countless other ways to be rigorous, but you still have to be kind of careful with “macro” vs “micro” states, it’s just a slippery concept.
Whether or not some 7B parameter NN that was like, Xavier-Xe initialized or whatever the Fortress of Solitude people are doing these days is more or less unique than after you push an exabyte of Harry Potter fan fiction through it?
I think that’s an interesting question even if I (we) haven’t yet posed it in a rigorous way.
I think it’s extremely thought provoking why they would be symmetrically, locally Euclidean in such abundance.
I’m paying attention, but we are talking about a giant neural network trained by my friends and former colleagues at FAIR based mostly out of FBNY where I used to go every day, so, I’ll contend there’s some math involved: this is a topic for people who make a serious priority out of it these days.
The controversial piece no one is coming right out and saying, I think it’s my “fuck @sama” refrain.
Though how something that’s a meme on YouTube channels about typescript is a bigger topic than finally giving Emmy Nother her props (if she’d been a man she’d be far more famous than e.g. Heisenberg) eludes me.
I’m saying that an iconic mathematician and physicist deprived of her rightful place in history had it right, and once crooks like “Fired for Fraud Repeatedly” Altman and Madame Su are out of the picture, we might re-learn what she taught us.
On reflection? Fuck you, you’re annoying, ignorant, and a shill if your comments are anything to go by.
The conclusions you seem to draw are by no means conclusive and at best seem only vaguely related to the unclear moral, ethical or legal stance you seem to believe in.
E.g., this extended-context Llama 3 70B requires 64GB at 256K context and over 100GB at 1M.
https://chat.openai.com/share/d228e04e-ae36-4468-ac45-fdb035...
I wish there was something like "can you run it" but for LLMs.
Instead we all wasted memory remembering that twice.
I plan to forget your username. I hope I never have cause to remember it.
Ronin, masterless. There’s no one to call me to heel if I take a dislike.