Qwen VLo: From “Understanding” the World to “Depicting” It

(qwenlm.github.io)

221 points lnyan | 2 comments | 27 Jun 25 14:35 UTC | HN request time: 0.408s | source

Show context

rushingcreek ◴[27 Jun 25 14:51 UTC] No.44397235[source]▶

It doesn't seem to have open weights, which is unfortunate. One of Qwen's strengths historically has been their open-weights strategy, and it would have been great to have a true open-weights competitor to 4o's autoregressive image gen. There are so many interesting research directions that are only possible if we can get access to the weights.

If Qwen is concerned about recouping its development costs, I suggest looking at BFL's Flux Kontext Dev release from the other day as a model: let researchers and individuals get the weights for free and let startups pay for a reasonably-priced license for commercial use.

replies(4): >>44397843 #>>44397858 #>>44397893 #>>44398602 #

Jackson__ ◴[27 Jun 25 16:01 UTC] No.44397843[source]▶

>>44397235 #

It's also very clearly trained on OAI outputs, which you can tell from the orange tint to the images[0]. Did they even attempt to come up with their own data?

So it is trained off OAI, as closed off as OAI and most importantly: worse than OAI. What a bizarre strategy to gate-keep this behind an API.

[0]

https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...

replies(5): >>44397961 #>>44398084 #>>44398731 #>>44401456 #>>44418786 #

1. VladVladikoff ◴[27 Jun 25 17:40 UTC] No.44398731[source]▶

>>44397843 #

What would be the approximate cost of doing this? How many million API requests must be made? How many tokens in total?

replies(1): >>44399468 #

2. refulgentis ◴[27 Jun 25 19:17 UTC] No.44399468[source]▶

>>44398731 (TP) #

Most pedantically correct answer is "mu", because the answers are both derivable quantitively from "How many images do you want to train on?", which is answered by a qualitative question that doesn't admit numbers ("How high quality do you want it to be?")

Let's say it's 100 images because you're doing a quick LoRA. That'd be about $5.00 at medium quality (~$0.05/image) or $1 at low. ~($0.01/image)

Let's say you're training a standalone image model. OOM of input images is ~1B, so $10M at low and $50M at high.

250 tokens / image for low, ~1000 for medium, which gets us to:

Fastest LoRA? $1-$4. 25,000 - 100,000 tokens output. All the training data for a new image model? $10M-$50M, 2.5B - 10B tokens out.

↑