(qwenlm.github.io)

544 points tosh | 1 comments | 24 Mar 25 18:35 UTC | HN request time: 0.299s | source

Show context

Arcuru ◴[24 Mar 25 19:17 UTC] No.43464463[source]▶

Does anyone know how making the models multimodal impacts their text capabilities? The article is claiming this achieves good performance on pure text as well, but I'm curious if there is any analysis on how much impact it usually has.

I've seen some people claim it should make the models better at text, but I find that a little difficult to believe without data.

replies(2): >>43467109 #>>43471257 #

1. netdur ◴[25 Mar 25 01:06 UTC] No.43467109[source]▶

>>43464463 #

My understanding is that in multimodal models, both text and image vectors align to the same semantic space, this alignment seems to be the main difference from text-only models."

↑

Qwen2.5-VL-32B: Smarter and Lighter