←back to thread

612 points meetpateltech | 1 comments | | HN request time: 0s | source
Show context
jbarrow ◴[] No.42952017[source]
I've been very impressed by Gemini 2.0 Flash for multimodal tasks, including object detection and localization[1], plus document tasks. But the 15 requests per minute limit was a severe limiter while it was experimental. I'm really excited to be able to actually _do_ things with the model.

In my experience, I'd reach for Gemini 2.0 Flash over 4o in a lot of multimodal/document use cases. Especially given the differences in price ($0.10/million input and $0.40/million output versus $2.50/million input and $10.00/million output).

That being said, Qwen2.5 VL 72B and 7B seem even better at document image tasks and localization.

[1] https://notes.penpusher.app/Misc/Google+Gemini+101+-+Object+...

replies(1): >>42952471 #
Alifatisk ◴[] No.42952471[source]
> In my experience, I'd reach for Gemini 2.0 Flash over 4o

Why not use o1-mini?

replies(1): >>42952648 #
1. jbarrow ◴[] No.42952648[source]
Mostly because OpenAI's vision offerings aren't particularly compelling:

- 4o can't really do localization, and ime is worse than Gemini 2.0 and Qwen2.5 at document tasks

- 4o mini isn't cheaper than 4o for images because it uses a lot of tokens per image compared to 4o (~5600/tile vs 170/tile, where each tile is 512x512)

- o1 has support for vision but is wildly expensive and slow

- o3-mini doesn't yet have support for vision, and o1-mini never did