Qwen2.5-VL-32B: Smarter and Lighter

(qwenlm.github.io)

544 points tosh | 2 comments | 24 Mar 25 18:35 UTC | HN request time: 0s | source

Show context

Arcuru ◴[24 Mar 25 19:17 UTC] No.43464463[source]▶

Does anyone know how making the models multimodal impacts their text capabilities? The article is claiming this achieves good performance on pure text as well, but I'm curious if there is any analysis on how much impact it usually has.

I've seen some people claim it should make the models better at text, but I find that a little difficult to believe without data.

replies(2): >>43467109 #>>43471257 #

1. kmacdough ◴[25 Mar 25 13:47 UTC] No.43471257[source]▶

>>43464463 #

I am having a hard time finding controlled testing, but the premise is straightforward: different modalities encourage different skills and understandings. Text builds up more formal idea tokenization and strengthens logic/reasoning while images require it learns a more robust geometric intuition. Since these learnings are applied to the same latent space, the strengths can be cross-applied.

The same applies to humans. Imagine a human who's only life involved reading books in a dark room, vs one who could see images vs one who can actually interact.

replies(1): >>43538865 #

2. ItDoBeWimdyTho ◴[31 Mar 25 19:31 UTC] No.43538865[source]▶

>>43471257 (TP) #

That comparison actually makes human reasoning abilities more impressive.

Helen Keller still learned robust generalizations.

↑