It's the same Jevons paradox reason as why LLMs are so big despite massive diminishing returns. If we can output 4096Ds, why not use all the Ds?
Like LLMs, the bottleneck is still training data and the training regimen, but there's still a demand for smaller embedding models due to both storage and compute concerns. EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m), released just yesterday, beats the 4096D Qwen-3 benchmarks at 768D, and using the 128D equivalent via MRL beats many 768D embedding models.
replies(1):