Qwen VLo: From “Understanding” the World to “Depicting” It

(qwenlm.github.io)

221 points lnyan | 1 comments | 27 Jun 25 14:35 UTC | HN request time: 0.001s | source

Show context

godelski ◴[27 Jun 25 18:24 UTC] No.44399053[source]▶

As a ML researcher and a degree holding physicist, I'm really hesitant to use the words "understanding" and "describing" (much less hesitant) around these models. I don't find the language helpful and think it's mostly hateful tbh.

The reason we use math in physics is because of its specificity. The same reason coding is so hard [0,1]. I think people aren't giving themselves enough credit here for how much they (you) understand about things. It is the nuances that really matter. There's so much detail here and we often forget how important they are because it is just normal to us. It's like forgetting about the ground you walk upon.

I think something everyone should read about is Asimov's "Relativity of Wrong"[2]. This is what we want to see in these systems if we want to start claiming they understand things. We want to see them to deduction and abduction. To be able to refine concepts and ideas. To be able to discover things that are more than just a combination of things they've ingested. What's really difficult here is that we train these things on all human knowledge and just reciting back that knowledge doesn't demonstrate intelligence. It's very unlikely that they losslessly compress that knowledge into these model sizes, but without very deep investigation into that data and probing at this knowledge it is very hard to understand what it knows and what it memorizes. Really, this is a very poor way to go about trying to make intelligence[3], or at least making intelligence and ending up knowing it is intelligent.

To really "understand" things we need to be able to propose counterfactuals[4]. Every physics statement is a counterfactual statement. Take F=ma as a trivial example. We can modify the mass or the acceleration to our heart's content and still determine the force. We can observe a specific mass moving at a specific acceleration and then ask the counterfactual "what if it was twice as heavy?" (twice the mass). *We can answer that!* In fact, your mental model of the world does this too! Yo may not be describing it with math (maybe you are ;) but you are able to propose counterfactuals and do a pretty good job a lot of the time. Doesn't mean you always need to be right though. But the way our heads work is through these types of systems. You daydream these things, you imagine them while you play, and all sorts of things. This, I can say, with high confidence, is not something modern ML (AI) systems do.

  == Edit ==

A good example of lack of understanding is the image OP uses. Not only does the right have the wrong number of fingers but look at the keys on the keyboard. It does not take much understanding to recognize that you shouldn't have repeated keys... the configuration is all wonky too, like one of those dreams you can immediately tell is a dream[5]. I'd also be willing to bet that the number of keys doesn't align to the number of markers and definitely the sizing looks off. The more you look at it the worse it gets, and that's really common among these systems. Nice at a quick glance but DEEP in the uncanny valley at more than a glance and deeper the more you look.

[0] https://youtube.com/watch?v=cDA3_5982h8

[1] Code is math. There's an isomorphism between Turing complete languages and computable mathematics. You can look more into my namesake, church, and Turing if you want to get more formal or wait for the comment that corrects a nuanced mistake here (yes, it exists). Also, note that physics and math are not the same thing, but mathematics is unreasonably effective (yes, this is a reference).

[2] https://hermiene.net/essays-trans/relativity_of_wrong.html

[3] This is a very different statement than "making something useful." Without a doubt these systems are useful. Do not conflate these

[4] https://en.wikipedia.org/wiki/Counterfactual_thinking

[5] Yes, you can read in dreams. I do it frequently. Though on occasion I have lucid dreamed because I read something and noticed that it changed when I looked away and looked back.

replies(1): >>44399973 #

BoorishBears ◴[27 Jun 25 20:29 UTC] No.44399973[source]▶

>>44399053 #

As a person who builds stuff, I'm tired of these strawmen.

It is helpful that they chose words that are widely understood to represent input vs output.

They even used scare quotes to signal they're not making some overly grand claim in terms of the long tail implications of the terms.

A person reading the release would learn previously Qwen had a VLM that could understand/see/precive/whateverwordyouwanttouse and now it can generate images which is could be depicting/drawing/portraying/whateverotherwordyouwanttouse

We don't have to invent a crisis past that.

replies(1): >>44400950 #

godelski ◴[27 Jun 25 22:42 UTC] No.44400950[source]▶

>>44399973 #

  > As a person who builds stuff, I'm tired of these strawmen.

Who says I don't build stuff?[0]

Allow me to quote Knuth. I think we can agree he built a lot of stuff

  | If you find that you're spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you find that you're spending almost all your time on practice, start turning some attention to theoretical things; it will improve your practice.

This is important. I don't know you and your beliefs, but some people truly believe theory is useless. But it's the foundation of everything we do.

  > We don't have to invent a crisis past that.

You're right. But I'm not. Qwen isn't the only one here in the larger conversation. Look around the comments and see who can't tell the difference. Look at the announcements companies make. PhD level intelligence? lol. So I suggest taking your own advice. I've made no strawman...

[0] my undergrad I did experimental physics, not theory. I then worked as an aerospace engineer for years. I built a literal rocket engine. I built advanced radiation shielding that NASA uses. Then I came back to school and my PhD is in CS. I build things. Don't confuse the fact that I want to understand things interferes with that. Truth is I'm good at building things because I spend time with theory. See Knuth

replies(1): >>44401739 #

BoorishBears ◴[28 Jun 25 01:36 UTC] No.44401739{3}[source]▶

>>44400950 #

I didn't say you don't build stuff: that diatribe is just very clearly someone speaking as an academic.

You're presumably intelligent enough to realize the writer here wasn't trying to define "understanding" from first principles.

And from a more practical mindset you'd hopefully realize it's not a useful expenditure of energy for them or the reader to enter the tarpit in the first place.

So far, if I extract the one practice-minded point you've touched on, it's much narrower: how the lack of generalization intersects with parties making claims about "PhD levels of intelligence" based on narrow benchmarks.

That's the conversation that can be had without resorting to strawmen or declaring an impasse on the language used to describe these systems until we've found the terms that satisify all other disciplines in addition to this one.

Maybe you've spent your life absorbing Knuth's essence and know better than me, but he strikes me as pragmatic enough to not fall for that trap either.

He even refers to LLMs as X% intelligent machines after he decided having someone else use ChatGPT on his behalf was the best way to evaluate it, right?

replies(1): >>44410386 #

1. godelski ◴[29 Jun 25 04:36 UTC] No.44410386{4}[source]▶

>>44401739 #

  > I didn't say you don't build stuff

You're right.

  > You're presumably intelligent enough to realize

There's two types of people Those that can extrapolate data.

↑