Show HN: Using VLLMs for RAG – skip the fragile OCR

(github.com)

2 points jonathan-adly | 1 comments | 30 Nov 24 15:28 UTC | HN request time: 0.205s | source

Hi HN

We wanted to show Colivara! It is a suite of services that allows you to store, search, and retrieve documents based on their visual embeddings and understanding.

ColiVara has state of the art retrieval performance on both *text* and visual documents, offering superior multimodal understanding and control.

It is a api-first implementation of the ColPali paper using ColQwen2 as the vLLM model. It works exactly like RAG from the end-user standpoint - but using vision models instead of chunking and text-processing for documents. No OCR, no text extraction, no broken tables, or missing images. What you see, is what you get.

On evals - it outperformed OCR + BM25 by 33%. It is also much better than captioning + BM25 by a similar amount.

Unlike traditional OCR(caption)/chunk/embed pipelines with Cosine similarity - where there are lots of fragility. ColiVara embeds documents at the page level and uses ColBert-style maxsim calculations. These are computationally demanding, but are much better at retrieval tasks. You can read about our benchmarking here: https://blog.colivara.com/from-cosine-to-dot-benchmarking-si...

Looking forward to hearing your feedback.