In this case they really made the numbers smaller than they should be, so it's hard to see that the scale is on the order of single digits. It looks like this is about a 3-4% improvement over GPT-4o-mini and Gemini Pro 1.5.
The bigger problem here is not the axis baseline, but the fact that I have no idea (as a non-AI-researcher) what benchmark this is, or if 0 is even the natural minimum. The caption should at least mention what the natural range of the x-axis is.
The figure labels it as as "average score [on] 6 multimodal reasoning benchmarks", and the caption notes that the full results are in table 7 - which lists those benchmarks: MMStar-R, MMBench-R, MMVet-R, MathVista, AI2D, Hallusion
I think it's mostly fine as a lead diagram giving an overview before going into detail.
Consider their Proposed Method:
"Each stage is initiated at the model’s discretion, without external prompt engineering frameworks or additional prompting. Specifically, we provide the model with four pairs of special tags: <SUMMARY></SUMMARY>, <CAPTION></CAPTION>, <REASONING></REASONING>, and <CONCLUSION></CONCLUSION>.
These tags correspond to summarizing the response approach, describing relevant image content, conducting reasoning, and preparing a final answer, respectively. Upon training, the model autonomously selects these tags as needed, activating each stage based on its own judgment.
As with OpenAI o1 [63], all stages are completed by the model in a single inference pass."
Always a pass from me, gets things off on the wrong foot right away.
Further this is a paper on arXiv, so the idea by some that it's meant to deceive -- as if the target audience isn't going to immediately look at the axis labels, and for more dig into what the benchmarks even were -- is not convincing.
I'd hold more criticism for the fact that their lead graphic specifically excludes options which beat it (e.g. GPT-4o, Sonnet), though these details can be found in the chart below.
Still interesting. And this "structuring AI" approach is how the next evolution in AI is happening.
Quote from this paper: “ Moreover, they (VLM) frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.”
I care about whether these VLMs can accurately _see_ and _describe_ things in a picture. Meanwhile the vision part of these benchmarks are a lot of extremely basic OCR that any VLMs of the past year can do. The gains in score come from the LM improving logic skills not from the actual vision ability improving.
For instance, if I have a CAD model of a screw fastened to a wall, can I teach it that its a screw fastened to a wall?
I have years worth of potential training data.
Consider this a multi-million dollar problem.
I watched a professor lecture on the likely candidates for what the open source llm community think is going on in o1[0] and I'm not convinced it's still simple pattern matching. [0] https://youtu.be/6PEJ96k1kiw
I'm not so confident that humans reason in a fundamentally different way than pattern matching. Perhaps paradigms focused on predicting the next token is too limiting. Reasoning plausibly involves pattern matching relevant schema representations, then executing along that schema. The ability to intuit that an existing schema is applicable to a certain situation is a good measure of intelligence, IMO. Could even make a good llm metric.
And let's also be fair, it would take a lot of effort for a human to generalize to a previously unseen pattern as well, so I always wonder just how useful it is to try to make such binary statements as "models don't reason" or they're "stochastic parrots". But maybe it's to counterweigh the statements that they are sentient, AGI is here, etc?
Unfortunately the practice of showing the latter slice runs along that of showing the whole bars, so a better convention to distinguish the two would be beneficial.
For example, "breaking" the bars (on the left side), similarly to when some bars run too far on the right side. I.e.:
| ==//====|
| ==//========|
| ==//===|
+----------------
...which is not uncommon practice already.After having formulated an idea, do you put it on your intellectual bench and re-examine it, purposefully, analytically? Well, that is more than plain pattern matching over intellectual keys - it is procedural.
And what about those intellectual keys or «schemas», how are they generated? Through a verification, consolidation that is further to the original (pattern matching) intuition.
Can you show conclusively that LLMs can't do this or don't already do this to some degree?
I have skimmed through another relevant piece today: it seems we are not proceeding with adequate pace with the interpretation of the internals, with the gained "transparency" of the architecture...
It's a subject of active research the extent LLM "reasoning" really is reasoning similar to humans, or something of a strictly weaker class entirely.
Personally I'm of the opinion human reasoning is really just "pattern matching", but we're also still waiting for the cognitive scientists to give us an answer on that one.
There are more interpretations of "pattern matching".
Of course it seems a fundamental component of generating ideas, but then those ideas are put - by intellectuals - on a bench and criticized actively. The two activities have important differences. First you look and go "they seem four", but then you count to be sure.
The second part is absolutely critical to determine a well working reasoner.
LLAMA can be trusted to summarize and format information, and some of the other models can be OK coding assistances, but when I was showing Ollama off to a friend I struggled to think of anything useful other than a party trick of "yup that's what is in the picture".
Obviously it would be useful to blind people, but the hard part is using it for something where the person could just look at the picture. Possibly could be used on a security camera and combined with a basic keyword alert, but I imagine there's a lot of false positives and false negatives.