Figure 2 in the paper shows what I really dislike about a lot of vision model benchmarks.
I care about whether these VLMs can accurately _see_ and _describe_ things in a picture. Meanwhile the vision part of these benchmarks are a lot of extremely basic OCR that any VLMs of the past year can do. The gains in score come from the LM improving logic skills not from the actual vision ability improving.