←back to thread

1303 points serjester | 1 comments | | HN request time: 0.226s | source
Show context
bt3 ◴[] No.42953466[source]
One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.

--

[1]: https://github.com/google-gemini/cookbook/blob/a916686f95f43...

replies(1): >>42953840 #
1. maeil ◴[] No.42953840[source]
Have you seen any models that perform better at this? I last looked into this a year ago but at the time they were indeed quite bad at it across the board.