LLaVA-O1: Let Vision Language Models Reason Step-by-Step

(arxiv.org)

177 points lnyan | 2 comments | 18 Nov 24 09:44 UTC | HN request time: 0.39s | source

Show context

Wilsoniumite ◴[18 Nov 24 13:56 UTC] No.42172299[source]▶

>>42171043 (OP) #

That first page graph has a very interesting choice of x-axis.

replies(3): >>42172325 #>>42173922 #>>42174109 #

jerpint ◴[18 Nov 24 14:00 UTC] No.42172325[source]▶

>>42172299 #

Sadly this is seen at so many prestigious ML conferences, a trimmed X axis which makes performance seem significant when it’s sometimes incremental

replies(1): >>42172545 #

exe34 ◴[18 Nov 24 14:28 UTC] No.42172545[source]▶

>>42172325 #

I think it's acceptable if you're trying to show subtle differences - but I would probably put the whole plot and then the zoomed version and clearly label it as "zoomed in for highlighting <.....>"

replies(1): >>42172869 #

nerdponx ◴[18 Nov 24 14:58 UTC] No.42172869[source]▶

>>42172545 #

You don't need to include 0 on every axis.

In this case they really made the numbers smaller than they should be, so it's hard to see that the scale is on the order of single digits. It looks like this is about a 3-4% improvement over GPT-4o-mini and Gemini Pro 1.5.

The bigger problem here is not the axis baseline, but the fact that I have no idea (as a non-AI-researcher) what benchmark this is, or if 0 is even the natural minimum. The caption should at least mention what the natural range of the x-axis is.

replies(1): >>42173053 #

1. Ukv ◴[18 Nov 24 15:15 UTC] No.42173053[source]▶

>>42172869 #

> the fact that I have no idea (as a non-AI-researcher) what benchmark this is

The figure labels it as as "average score [on] 6 multimodal reasoning benchmarks", and the caption notes that the full results are in table 7 - which lists those benchmarks: MMStar-R, MMBench-R, MMVet-R, MathVista, AI2D, Hallusion

I think it's mostly fine as a lead diagram giving an overview before going into detail.

replies(1): >>42173140 #

2. nerdponx ◴[18 Nov 24 15:21 UTC] No.42173140[source]▶

>>42173053 (TP) #

Right, I don't need to know what they are, I just need to know what "64" means. Is the baseline actually 0? That detail is enough to avoid actually drawing 0 on the axis.

↑