←back to thread

1303 points serjester | 2 comments | | HN request time: 0s | source
1. bt3 ◴[] No.42953466[source]
One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.

--

[1]: https://github.com/google-gemini/cookbook/blob/a916686f95f43...

replies(1): >>42953840 #
2. maeil ◴[] No.42953840[source]
Have you seen any models that perform better at this? I last looked into this a year ago but at the time they were indeed quite bad at it across the board.