←back to thread

176 points lnyan | 9 comments | | HN request time: 1.265s | source | bottom
1. Wilsoniumite ◴[] No.42172299[source]
That first page graph has a very interesting choice of x-axis.
replies(3): >>42172325 #>>42173922 #>>42174109 #
2. jerpint ◴[] No.42172325[source]
Sadly this is seen at so many prestigious ML conferences, a trimmed X axis which makes performance seem significant when it’s sometimes incremental
replies(1): >>42172545 #
3. exe34 ◴[] No.42172545[source]
I think it's acceptable if you're trying to show subtle differences - but I would probably put the whole plot and then the zoomed version and clearly label it as "zoomed in for highlighting <.....>"
replies(1): >>42172869 #
4. nerdponx ◴[] No.42172869{3}[source]
You don't need to include 0 on every axis.

In this case they really made the numbers smaller than they should be, so it's hard to see that the scale is on the order of single digits. It looks like this is about a 3-4% improvement over GPT-4o-mini and Gemini Pro 1.5.

The bigger problem here is not the axis baseline, but the fact that I have no idea (as a non-AI-researcher) what benchmark this is, or if 0 is even the natural minimum. The caption should at least mention what the natural range of the x-axis is.

replies(1): >>42173053 #
5. Ukv ◴[] No.42173053{4}[source]
> the fact that I have no idea (as a non-AI-researcher) what benchmark this is

The figure labels it as as "average score [on] 6 multimodal reasoning benchmarks", and the caption notes that the full results are in table 7 - which lists those benchmarks: MMStar-R, MMBench-R, MMVet-R, MathVista, AI2D, Hallusion

I think it's mostly fine as a lead diagram giving an overview before going into detail.

replies(1): >>42173140 #
6. nerdponx ◴[] No.42173140{5}[source]
Right, I don't need to know what they are, I just need to know what "64" means. Is the baseline actually 0? That detail is enough to avoid actually drawing 0 on the axis.
7. jdonaldson ◴[] No.42173922[source]
"Convincing you is more important than informing you"

Always a pass from me, gets things off on the wrong foot right away.

8. llm_nerd ◴[] No.42174109[source]
What's wrong with it? Among the graphed cohort the average benchmark score was between 56 - 66, so they scaled to 55-67. Such a strategy to differentiate is completely normal, and it's weird how often this is called out as being deceptive.

Further this is a paper on arXiv, so the idea by some that it's meant to deceive -- as if the target audience isn't going to immediately look at the axis labels, and for more dig into what the benchmarks even were -- is not convincing.

I'd hold more criticism for the fact that their lead graphic specifically excludes options which beat it (e.g. GPT-4o, Sonnet), though these details can be found in the chart below.

Still interesting. And this "structuring AI" approach is how the next evolution in AI is happening.

replies(1): >>42176784 #
9. mdp2021 ◴[] No.42176784[source]
> What's wrong with it

Unfortunately the practice of showing the latter slice runs along that of showing the whole bars, so a better convention to distinguish the two would be beneficial.

For example, "breaking" the bars (on the left side), similarly to when some bars run too far on the right side. I.e.:

  | ==//====|
  | ==//========|
  | ==//===|
  +----------------
...which is not uncommon practice already.