Qwen2.5-VL-32B: Smarter and Lighter

(qwenlm.github.io)

Show context

simonw ◴[24 Mar 25 18:53 UTC] No.43464243[source]▶

32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).

replies(9): >>43464289 #>>43464380 #>>43464443 #>>43464588 #>>43464688 #>>43467991 #>>43468940 #>>43469099 #>>43470619 #

1. YetAnotherNick ◴[24 Mar 25 19:14 UTC] No.43464443[source]▶

>>43464243 #

I don't think these models are GPT-4 level. Yes they seem to be on benchmarks, but it has been known that models increasingly use A/B testing in dataset curation and synthesis(using GPT 4 level models) to optimize not just the benchmarks but things which could be benchmarked like academics.

replies(2): >>43464533 #>>43468989 #

2. simonw ◴[24 Mar 25 19:24 UTC] No.43464533[source]▶

>>43464443 (TP) #

I'm not talking about GPT-4o here - every benchmark I've seen has had the new models from the past ~12 months out-perform the March 2023 GPT-4 model.

To pick just the most popular one, https://lmarena.ai/?leaderboard= has GPT-4-0314 ranked 83rd now.

replies(1): >>43465368 #

3. th0ma5 ◴[24 Mar 25 21:01 UTC] No.43465368[source]▶

>>43464533 #

How have you been able to tie benchmark results to better results?

replies(1): >>43465877 #

4. simonw ◴[24 Mar 25 22:02 UTC] No.43465877{3}[source]▶

>>43465368 #

Vibes and intuition. Not much more than that.

replies(1): >>43474204 #

5. tosh ◴[25 Mar 25 08:13 UTC] No.43468989[source]▶

>>43464443 (TP) #

Also "GPT-4 level" is a bit loaded. One way to think about it that I found helpful is to split how good a model is into "capability" and "knowledge/hallucination".

Many benchmarks test "capability" more than "knowledge". There are many use cases where the model gets all the necessary context in the prompt. There a model with good capability for the use case will do fine (e.g. as good as GPT-4).

That same model might hallucinate when you ask about the plot of a movie while a larger model like GPT-4 might be able to recall better what the movie is about.

6. th0ma5 ◴[25 Mar 25 18:16 UTC] No.43474204{4}[source]▶

>>43465877 #

Don't you think that presenting this as learning or knowledge is unethical?

↑