LLaVA-O1: Let Vision Language Models Reason Step-by-Step

The o1 connection is made through "Evaluation of openai o1: Opportunities and challenges of AGI"[63]—a paper mill product with 50 or so authors. They created that 280-page monstrosity in less than two weeks of the o1 release. Did I miss something? AFAIK, there's no published literature from OpenAI on o1, and nobody knows what o1 is doing exactly, but it seems the Chinese have figured it out in the matter of days... They say their model performs well on visual benchmarks, but I suspect it probably owes to them overfitting on these benchmarks in the first place.

Consider their Proposed Method:

"Each stage is initiated at the model’s discretion, without external prompt engineering frameworks or additional prompting. Specifically, we provide the model with four pairs of special tags: <SUMMARY></SUMMARY>, <CAPTION></CAPTION>, <REASONING></REASONING>, and <CONCLUSION></CONCLUSION>.

These tags correspond to summarizing the response approach, describing relevant image content, conducting reasoning, and preparing a final answer, respectively. Upon training, the model autonomously selects these tags as needed, activating each stage based on its own judgment.

As with OpenAI o1 [63], all stages are completed by the model in a single inference pass."

[63]: https://arxiv.org/pdf/2409.18486