LLaVA-O1: Let Vision Language Models Reason Step-by-Step

(arxiv.org)

177 points lnyan | 1 comments | 18 Nov 24 09:44 UTC | HN request time: 0.206s | source

Show context

Wilsoniumite ◴[18 Nov 24 13:56 UTC] No.42172299[source]▶

>>42171043 (OP) #

That first page graph has a very interesting choice of x-axis.

replies(3): >>42172325 #>>42173922 #>>42174109 #

llm_nerd ◴[18 Nov 24 16:41 UTC] No.42174109[source]▶

>>42172299 #

What's wrong with it? Among the graphed cohort the average benchmark score was between 56 - 66, so they scaled to 55-67. Such a strategy to differentiate is completely normal, and it's weird how often this is called out as being deceptive.

Further this is a paper on arXiv, so the idea by some that it's meant to deceive -- as if the target audience isn't going to immediately look at the axis labels, and for more dig into what the benchmarks even were -- is not convincing.

I'd hold more criticism for the fact that their lead graphic specifically excludes options which beat it (e.g. GPT-4o, Sonnet), though these details can be found in the chart below.

Still interesting. And this "structuring AI" approach is how the next evolution in AI is happening.

replies(1): >>42176784 #

1. mdp2021 ◴[18 Nov 24 20:44 UTC] No.42176784[source]▶

>>42174109 #

> What's wrong with it

Unfortunately the practice of showing the latter slice runs along that of showing the whole bars, so a better convention to distinguish the two would be beneficial.

For example, "breaking" the bars (on the left side), similarly to when some bars run too far on the right side. I.e.:

  | ==//====|
  | ==//========|
  | ==//===|
  +----------------

...which is not uncommon practice already.

↑