We wanted to show Colivara! It is a suite of services that allows you to store, search, and retrieve documents based on their visual embeddings and understanding.
ColiVara has state of the art retrieval performance on both *text* and visual documents, offering superior multimodal understanding and control.
It is a api-first implementation of the ColPali paper using ColQwen2 as the vLLM model. It works exactly like RAG from the end-user standpoint - but using vision models instead of chunking and text-processing for documents. No OCR, no text extraction, no broken tables, or missing images. What you see, is what you get.
On evals - it outperformed OCR + BM25 by 33%. It is also much better than captioning + BM25 by a similar amount.
Unlike traditional OCR(caption)/chunk/embed pipelines with Cosine similarity - where there are lots of fragility. ColiVara embeds documents at the page level and uses ColBert-style maxsim calculations. These are computationally demanding, but are much better at retrieval tasks. You can read about our benchmarking here: https://blog.colivara.com/from-cosine-to-dot-benchmarking-si...
Looking forward to hearing your feedback.