Most active commenters

mdp2021(4)

Popular/hot comments

>>42172299 #
>>42175636 #

LLaVA-O1: Let Vision Language Models Reason Step-by-Step

(arxiv.org)

1. Wilsoniumite ◴[18 Nov 24 13:56 UTC] No.42172299[source]▶

>>42171043 (OP) #

That first page graph has a very interesting choice of x-axis.

replies(3): >>42172325 #>>42173922 #>>42174109 #

2. jerpint ◴[18 Nov 24 14:00 UTC] No.42172325[source]▶

>>42172299 #

Sadly this is seen at so many prestigious ML conferences, a trimmed X axis which makes performance seem significant when it’s sometimes incremental

replies(1): >>42172545 #

3. exe34 ◴[18 Nov 24 14:28 UTC] No.42172545{3}[source]▶

>>42172325 #

I think it's acceptable if you're trying to show subtle differences - but I would probably put the whole plot and then the zoomed version and clearly label it as "zoomed in for highlighting <.....>"

replies(1): >>42172869 #

4. nerdponx ◴[18 Nov 24 14:58 UTC] No.42172869{4}[source]▶

>>42172545 #

You don't need to include 0 on every axis.

In this case they really made the numbers smaller than they should be, so it's hard to see that the scale is on the order of single digits. It looks like this is about a 3-4% improvement over GPT-4o-mini and Gemini Pro 1.5.

The bigger problem here is not the axis baseline, but the fact that I have no idea (as a non-AI-researcher) what benchmark this is, or if 0 is even the natural minimum. The caption should at least mention what the natural range of the x-axis is.

replies(1): >>42173053 #

5. Ukv ◴[18 Nov 24 15:15 UTC] No.42173053{5}[source]▶

>>42172869 #

> the fact that I have no idea (as a non-AI-researcher) what benchmark this is

The figure labels it as as "average score [on] 6 multimodal reasoning benchmarks", and the caption notes that the full results are in table 7 - which lists those benchmarks: MMStar-R, MMBench-R, MMVet-R, MathVista, AI2D, Hallusion

I think it's mostly fine as a lead diagram giving an overview before going into detail.

replies(1): >>42173140 #

6. nerdponx ◴[18 Nov 24 15:21 UTC] No.42173140{6}[source]▶

>>42173053 #

Right, I don't need to know what they are, I just need to know what "64" means. Is the baseline actually 0? That detail is enough to avoid actually drawing 0 on the axis.

7. tucnak ◴[18 Nov 24 15:56 UTC] No.42173578[source]▶

>>42171043 (OP) #

The o1 connection is made through "Evaluation of openai o1: Opportunities and challenges of AGI"[63]—a paper mill product with 50 or so authors. They created that 280-page monstrosity in less than two weeks of the o1 release. Did I miss something? AFAIK, there's no published literature from OpenAI on o1, and nobody knows what o1 is doing exactly, but it seems the Chinese have figured it out in the matter of days... They say their model performs well on visual benchmarks, but I suspect it probably owes to them overfitting on these benchmarks in the first place.

Consider their Proposed Method:

"Each stage is initiated at the model’s discretion, without external prompt engineering frameworks or additional prompting. Specifically, we provide the model with four pairs of special tags: <SUMMARY></SUMMARY>, <CAPTION></CAPTION>, <REASONING></REASONING>, and <CONCLUSION></CONCLUSION>.

These tags correspond to summarizing the response approach, describing relevant image content, conducting reasoning, and preparing a final answer, respectively. Upon training, the model autonomously selects these tags as needed, activating each stage based on its own judgment.

As with OpenAI o1 [63], all stages are completed by the model in a single inference pass."

[63]: https://arxiv.org/pdf/2409.18486

8. jdonaldson ◴[18 Nov 24 16:23 UTC] No.42173922[source]▶

>>42172299 #

"Convincing you is more important than informing you"

Always a pass from me, gets things off on the wrong foot right away.

9. llm_nerd ◴[18 Nov 24 16:41 UTC] No.42174109[source]▶

>>42172299 #

What's wrong with it? Among the graphed cohort the average benchmark score was between 56 - 66, so they scaled to 55-67. Such a strategy to differentiate is completely normal, and it's weird how often this is called out as being deceptive.

Further this is a paper on arXiv, so the idea by some that it's meant to deceive -- as if the target audience isn't going to immediately look at the axis labels, and for more dig into what the benchmarks even were -- is not convincing.

I'd hold more criticism for the fact that their lead graphic specifically excludes options which beat it (e.g. GPT-4o, Sonnet), though these details can be found in the chart below.

Still interesting. And this "structuring AI" approach is how the next evolution in AI is happening.

replies(1): >>42176784 #

10. startupsfail ◴[18 Nov 24 17:36 UTC] No.42174762[source]▶

>>42171043 (OP) #

Generating data with OpenAI model AND copying the approach from OpenAI model. This is a bit unsatisfactory, its like saying you wrote some working code, while in fact you’ve decompiled the binary and then compiled it again.

replies(1): >>42175872 #

11. yalok ◴[18 Nov 24 17:47 UTC] No.42174863[source]▶

>>42171043 (OP) #

This quote summarizes the main secret sauce to me - once they generate a wrong token/phrase, the whole answer goes south - and it basically explains why the whole CoT approach works - prevent LLM from generating a wrong answer with 2 tricks: 1) ask LLM explicitly to generate intermediate steps instead of a final answer and 2) use beam search (filtering from several answers at each stage) to reduce the risk of picking a wrong answer even further.

Quote from this paper: “ Moreover, they (VLM) frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.”

12. Jackson__ ◴[18 Nov 24 18:01 UTC] No.42175021[source]▶

>>42171043 (OP) #

Figure 2 in the paper shows what I really dislike about a lot of vision model benchmarks.

I care about whether these VLMs can accurately _see_ and _describe_ things in a picture. Meanwhile the vision part of these benchmarks are a lot of extremely basic OCR that any VLMs of the past year can do. The gains in score come from the LM improving logic skills not from the actual vision ability improving.

13. resource_waste ◴[18 Nov 24 18:55 UTC] No.42175636[source]▶

>>42171043 (OP) #

What are options to fine tune?

For instance, if I have a CAD model of a screw fastened to a wall, can I teach it that its a screw fastened to a wall?

I have years worth of potential training data.

Consider this a multi-million dollar problem.

replies(3): >>42175720 #>>42180165 #>>42183428 #

14. abdussamit ◴[18 Nov 24 19:02 UTC] No.42175720[source]▶

>>42175636 #

This is quite an interesting problem. Hope you find a solution to this, and wish I had the right knowledge to work on it

15. exe34 ◴[18 Nov 24 19:19 UTC] No.42175872[source]▶

>>42174762 #

well if you have working code at the end, you made progress. closedAI can pull any model at any time for a profit.

replies(1): >>42200516 #

16. mptest ◴[18 Nov 24 20:09 UTC] No.42176357[source]▶

>>42175961 #

Has it been shown for o1 conclusively? I'd love to read the paper. I recall that Apple paper about non reasoning due to fuzzed question data causing performance degradation that caught a lot of traction but IIRC o1 had pretty resilient performance compared to previous models. which to be clear I agree with your sentiment towards. I just have yet to see definitive data that shows o1 is not fundamentally more resilient to the types of test we use to discern "reasoning" from "pattern matching".

I watched a professor lecture on the likely candidates for what the open source llm community think is going on in o1[0] and I'm not convinced it's still simple pattern matching. [0] https://youtu.be/6PEJ96k1kiw

17. SubiculumCode ◴[18 Nov 24 20:15 UTC] No.42176413[source]▶

>>42175961 #

Can you provide an example or link?

I'm not so confident that humans reason in a fundamentally different way than pattern matching. Perhaps paradigms focused on predicting the next token is too limiting. Reasoning plausibly involves pattern matching relevant schema representations, then executing along that schema. The ability to intuit that an existing schema is applicable to a certain situation is a good measure of intelligence, IMO. Could even make a good llm metric.

replies(1): >>42176906 #

18. blixt ◴[18 Nov 24 20:35 UTC] No.42176666[source]▶

>>42175961 #

I don't completely disagree but I believe it's a bit more fuzzy than that. From what I understand, the models learn a very compressed version of what they receive as input and produce as output. While not sufficient to generalize, you could say they memorize some very high-dimensional function to cause the expected text to be produced, and they can turn on and combine multiple of these functions (multiply by non-zero, sum, etc). So on some level an LLM can kind of perform logic on the input, even if it has a slightly novel pattern. But at the same time, no model is shown to completely generalize the way a human would.

And let's also be fair, it would take a lot of effort for a human to generalize to a previously unseen pattern as well, so I always wonder just how useful it is to try to make such binary statements as "models don't reason" or they're "stochastic parrots". But maybe it's to counterweigh the statements that they are sentient, AGI is here, etc?

19. snats ◴[18 Nov 24 20:38 UTC] No.42176707[source]▶

>>42171043 (OP) #

This paper is not comparing against MOLMO or Qwen, so I would take it with a grain of salt

20. mdp2021 ◴[18 Nov 24 20:44 UTC] No.42176784{3}[source]▶

>>42174109 #

> What's wrong with it

Unfortunately the practice of showing the latter slice runs along that of showing the whole bars, so a better convention to distinguish the two would be beneficial.

For example, "breaking" the bars (on the left side), similarly to when some bars run too far on the right side. I.e.:

  | ==//====|
  | ==//========|
  | ==//===|
  +----------------

...which is not uncommon practice already.

21. mdp2021 ◴[18 Nov 24 20:55 UTC] No.42176906{3}[source]▶

>>42176413 #

> humans reason in a fundamentally different way

After having formulated an idea, do you put it on your intellectual bench and re-examine it, purposefully, analytically? Well, that is more than plain pattern matching over intellectual keys - it is procedural.

And what about those intellectual keys or «schemas», how are they generated? Through a verification, consolidation that is further to the original (pattern matching) intuition.

replies(1): >>42178329 #

22. blovescoffee ◴[18 Nov 24 21:37 UTC] No.42177368[source]▶

>>42175961 #

You’re going to PSA an opinion?

23. stevenhuang ◴[18 Nov 24 23:25 UTC] No.42178329{4}[source]▶

>>42176906 #

> After having formulated an idea, do you put it on your intellectual bench and re-examine it, purposefully, analytically?

Can you show conclusively that LLMs can't do this or don't already do this to some degree?

replies(1): >>42178373 #

24. mdp2021 ◴[18 Nov 24 23:28 UTC] No.42178373{5}[source]▶

>>42178329 #

Not "anatomically": only from the results.

I have skimmed through another relevant piece today: it seems we are not proceeding with adequate pace with the interpretation of the internals, with the gained "transparency" of the architecture...

replies(1): >>42178573 #

25. stevenhuang ◴[18 Nov 24 23:51 UTC] No.42178573{6}[source]▶

>>42178373 #

Precisely. The architecture is transparent but the latent representations within and the operations performed by LLMs are not.

It's a subject of active research the extent LLM "reasoning" really is reasoning similar to humans, or something of a strictly weaker class entirely.

Personally I'm of the opinion human reasoning is really just "pattern matching", but we're also still waiting for the cognitive scientists to give us an answer on that one.

replies(1): >>42181987 #

26. heyitsguay ◴[19 Nov 24 04:49 UTC] No.42180165[source]▶

>>42175636 #

What is your training data? Does it have both vision and descriptive components?

replies(1): >>42182122 #

27. mdp2021 ◴[19 Nov 24 10:40 UTC] No.42181987{7}[source]▶

>>42178573 #

> I'm of the opinion human reasoning is really just "pattern matching"

There are more interpretations of "pattern matching".

Of course it seems a fundamental component of generating ideas, but then those ideas are put - by intellectuals - on a bench and criticized actively. The two activities have important differences. First you look and go "they seem four", but then you count to be sure.

The second part is absolutely critical to determine a well working reasoner.

28. resource_waste ◴[19 Nov 24 10:59 UTC] No.42182122{3}[source]▶

>>42180165 #

yes

29. htrp ◴[19 Nov 24 13:53 UTC] No.42183428[source]▶

>>42175636 #

There are numerous companies in the construction space working on these types of problems. I'm sure some of them will reach out to you (if they haven't already).

30. Larrikin ◴[19 Nov 24 14:14 UTC] No.42183635[source]▶

>>42171043 (OP) #

Has anyone found a use for LLAVA yet?

LLAMA can be trusted to summarize and format information, and some of the other models can be OK coding assistances, but when I was showing Ollama off to a friend I struggled to think of anything useful other than a party trick of "yup that's what is in the picture".

Obviously it would be useful to blind people, but the hard part is using it for something where the person could just look at the picture. Possibly could be used on a security camera and combined with a basic keyword alert, but I imagine there's a lot of false positives and false negatives.

replies(1): >>42183818 #

31. ac1spkrbox ◴[19 Nov 24 14:31 UTC] No.42183818[source]▶

>>42183635 #

Multimodal models are useful for lots of things! They can accomplish a range a tasks from zero-shot image classification to helping perform Retrieval-Augmented Generation on images. Like many generative model, I find the utility comes not necessarily from outperforming a human, but from scaling a task that a human wouldn't want to do (or won't do cheaply).

32. startupsfail ◴[21 Nov 24 02:35 UTC] No.42200516{3}[source]▶

>>42175872 #

Yes, agreed. And it’s not like OpenAI isn’t doing the same thing, in a sense. Data was originally sampled from human annotations.

↑