(composio.dev)

483 points mraniki | 3 comments | 31 Mar 25 12:09 UTC | HN request time: 0.639s | source

Show context

mraniki ◴[31 Mar 25 12:10 UTC] No.43534033[source]▶

TL;DR

If you want to jump straight to the conclusion, I’d say go for Gemini 2.5 Pro, it’s better at coding, has one million in context window as compared to Claude’s 200k, and you can get it for free (a big plus). However, Claude’s 3.7 Sonnet is not that far behind. Though at this point there’s no point using it over Gemini 2.5 Pro.

replies(5): >>43534339 #>>43534356 #>>43534369 #>>43534373 #>>43534529 #

diggan ◴[31 Mar 25 12:47 UTC] No.43534373[source]▶

>>43534033 #

> has one million in context window

Is this effective context window or just the absolute limit? A lot of the models that claim to support very large context windows cannot actually successfully do the typical "needle in a haystack" test, but I'm guessing there are published results somewhere demonstrating Gemini 2.5 Pro can actually find the needle?

replies(2): >>43534443 #>>43534475 #

1. llm_nerd ◴[31 Mar 25 12:54 UTC] No.43534475[source]▶

>>43534373 #

Google has had almost perfect recall in the needle in the haystack test since 1.5[1], achieving close to 100% over the entire context window. I can't provide a link benchmarking 2.5 Pro in particular, but this has been a solved problem with Google models so I assume the same is true with their new model.

[1] https://cloud.google.com/blog/products/ai-machine-learning/t...

replies(1): >>43535972 #

2. diggan ◴[31 Mar 25 15:13 UTC] No.43535972[source]▶

>>43534475 (TP) #

Has those results been reproduced elsewhere with other benchmarks than what Google seems to use?

Hard to trust their own benchmarks at this point, and Im not home at the moment so cant try it myself either.

replies(1): >>43536690 #

3. llm_nerd ◴[31 Mar 25 16:15 UTC] No.43536690[source]▶

>>43535972 #

They are testing for a very straightforward needle retrieval, as LLMs traditionally were terrible for this in longer contexts.

There are some more advanced tests where it's far less impressive. Just a couple of days ago Adobe released one such test- https://github.com/adobe-research/NoLiMa

↑

Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison