Zamba2-7B

(www.zyphra.com)

282 points dataminer | 4 comments | 14 Oct 24 22:45 UTC | HN request time: 0.209s | source

Show context

SubiculumCode ◴[14 Oct 24 23:31 UTC] No.41843327[source]▶

When they say that they use two attention heads, are each attention head directed at different aspects of the data?

In memory research there is this idea that there is a dual representation of every event...a more verbatim representation, and more context weighted representation. As we develop over early childhood, our verbatim memory representations increase in fidelity and strength against interference, but peaks around 6 to 10 years, depending on the specifics. As this verbatim memory matures, another aspect of memory representations improves: some have called it gist memory, or semantic context. Increases in memory performance continue into adolescence primarily due to increases in the ability to use context and gist (broad representations that capture the details by inference or an event) to increase accuracy overall, but also greater likelihood of committing false alarms to lures primed by semantically related material during learning...expressly because there becomes greater reliance on context to support recall accuracy.

So I could imagine such a system in a LLM where attention is directed to exact representations in one head, and another that keeps its attention on a coarser grain of information that anchors information. However, I am not that familiar with LLMs to know if that is just silly analogizing.

replies(1): >>41844225 #

kla-s ◴[15 Oct 24 02:05 UTC] No.41844225[source]▶

>>41843327 #

Please someone correct me if I’m wrong, but my understanding of ML/LLMs is that this kind of hand crafting has been tried, but it is easier to train/less finicky to let behavior like this emerge from more data, see [1] “Bitter Lesson” and [2] “Scaling Laws”.

MAMBA as an architecture claims to have some significant gains performance wise, but to my knowledge there haven't been any really large models (>~100B params) with open weights/leaked MAMBA architecture been disclosed other than this (7B).

As mentioned by other comments, another dimension not to forget is the training data quality. Not only quantity but also quality really matters, is what we are learning more and more with LLMs..

[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html [2] see eg https://m.youtube.com/watch?v=5eqRuVp65eY&pp=ygUMU2NhbGluZyB... for a well made/easily digestable intro

replies(1): >>41845350 #

1. sanxiyn ◴[15 Oct 24 05:54 UTC] No.41845350[source]▶

>>41844225 #

Jamba 1.5 Large is 398B params (94B active) and weights are available.

https://arxiv.org/abs/2408.12570

replies(2): >>41845686 #>>41847006 #

2. kla-s ◴[15 Oct 24 06:54 UTC] No.41845686[source]▶

>>41845350 (TP) #

Thanks, missed that one.

For context gpt-4 is supposedly @ 1.8T params.

3. littlestymaar ◴[15 Oct 24 10:18 UTC] No.41847006[source]▶

>>41845350 (TP) #

Thanks for the link. The benchmark results aren't too impressive for its size but it likely hasn't been trained as thoroughly as llama (I couldn't find the training size in the paper but I doubt they have access to as much compute as Meta) so it still feels encouraging that it doesn't look ridiculous either.

replies(1): >>41847333 #

4. x_may ◴[15 Oct 24 11:15 UTC] No.41847333[source]▶

>>41847006 #

Not as much as meta, no. But AI21 labs is partnered with Amazon and did a ~$200M funding round last year IIRC so still plenty of funds for training big models

↑