(www.sergey.fyi)

1303 points serjester | 2 comments | 05 Feb 25 18:05 UTC | HN request time: 0.408s | source

1. bt3 ◴[05 Feb 25 19:05 UTC] No.42953466[source]▶

One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.

[1]: https://github.com/google-gemini/cookbook/blob/a916686f95f43...

replies(1): >>42953840 #

2. maeil ◴[05 Feb 25 19:31 UTC] No.42953840[source]▶

>>42953466 (TP) #

Have you seen any models that perform better at this? I last looked into this a year ago but at the time they were indeed quite bad at it across the board.

↑

Ingesting PDFs and why Gemini 2.0 changes everything