LLaVA-O1: Let Vision Language Models Reason Step-by-Step

This quote summarizes the main secret sauce to me - once they generate a wrong token/phrase, the whole answer goes south - and it basically explains why the whole CoT approach works - prevent LLM from generating a wrong answer with 2 tricks: 1) ask LLM explicitly to generate intermediate steps instead of a final answer and 2) use beam search (filtering from several answers at each stage) to reduce the risk of picking a wrong answer even further.

Quote from this paper: “ Moreover, they (VLM) frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.”