←back to thread

200 points Adityav369 | 1 comments | | HN request time: 0.227s | source

Hey HN, we’re Adi and Arnav. A few months ago, we hit a wall trying to get LLMs to answer questions over research papers and instruction manuals. Everything worked fine, until the answer lived inside an image or diagram embedded in the PDF. Even GPT‑4o flubbed it (we recently tried O3 with the same, and surprisingly it flubbed it too). Naive RAG pipelines just pulled in some text chunks and ignored the rest.

We took an invention disclosure PDF (https://drive.google.com/file/d/1ySzQgbNZkC5dPLtE3pnnVL2rW_9...) containing an IRR‑vs‑frequency graph and asked GPT “From the graph, at what frequency is the IRR maximized?”. We originally tried this on gpt-4o, but while writing this used the new natively multimodal model o4‑mini‑high. After a 30‑second thinking pause, it asked for clarifications, then churned out buggy code, pulled data from the wrong page, and still couldn’t answer the question. We wrote up the full story with screenshots here: https://docs.morphik.ai/blogs/gpt-vs-morphik-multimodal.

We got frustrated enough to try fixing it ourselves.

We built Morphik to do multimodal retrieval over documents like PDFs, where images and diagrams matter as much as the text.

To do this, we use Colpali-style embeddings, which treat each document page as an image and generate multi-vector representations. These embeddings capture layout, typography, and visual context, allowing retrieval to get a whole table or schematic, not just nearby tokens. Along with vector search, this could now retrieve exact pages with relevant diagrams and pass them as images to the LLM to get relevant answers. It’s able to answer the question with an 8B llama 3.1 vision running locally!

Early pharma testers hit our system with queries like "Which EGFR inhibitors at 50 mg showed ≥ 30% tumor reduction?" We correctly returned the right tables and plots, but still hit a bottleneck, we weren’t able to join the dots across multiple reports. So we built a knowledge graph: we tag entities in both text and images, normalize synonyms (Erlotinib → EGFR inhibitor), infer relations (e.g. administered_at, yields_reduction), and stitch everything into a graph. Now a single query could traverse that graph across documents and surface a coherent, cross‑document answer along with the correct pages as images.

To illustrate that, and just for fun, we built a graph of 100 Paul Graham’s essays here: https://pggraph.streamlit.app/ You can search for various nodes, (eg. startup, sam altman, paul graham and see corresponding connections). In our system, we create graphs and store the relevant text chunks along with the entities, so on querying, we can extract the relevant entity, do a search on the graph and pull in the text chunks of all connected nodes, improving cross document queries.

For longer or multi-turn queries, we added persistent KV caching, which stores intermediate key-value states from transformer attention layers. Instead of recomputing attention from scratch every time, we reuse prior layers, speeding up repeated queries and letting us handle much longer context windows.

We’re open‑source under the MIT Expat license: https://github.com/morphik-org/morphik-core

Would love to hear your RAG horror stories, what worked, what didn’t and any feedback on Morphik. We’re here for it.

Show context
thot_experiment ◴[] No.43766537[source]
I'd love to have something like this but calling a cloud is a no-go for me. I have a half baked tool that a friend of mine and I applied to the Mozilla Builders Grant with (didn't get in), it's janky and I don't have time to work on it right now but it does the thing. I also find myself using OpenWebUI's context RAG stuff sometimes but I'd really like to have a way to dump all of my private documents into a DB and have search/RAG work against them locally, preferably in a way that's agnostic of the LLM backend.

Does such a project exist?

replies(3): >>43766596 #>>43767627 #>>43777647 #
osigurdson ◴[] No.43767627[source]
Just curious, are you fine with running things in your own AWS / Azure / GCP account or do you really mean that the solution has to be fully on-premise?
replies(1): >>43768616 #
thot_experiment ◴[] No.43768616[source]
Airgapped. It really makes threat modelling so so soooo much easier. It's temporal so if I were being attacked by a state level actor exfiltration is possible but this specific application I either have the data live and no internet, or internet and no data. I also have some lesser stuff that I allow on-prem w/ internet and just trust the firewall, but absolutely no way am I doing any sensitive data storage or inference in the cloud.

Since people will be curious, one lesser thing I used this for is a diary/assistant and it's nice to have the peace of mind that I can dump my inner most thoughts without any concern for oversharing.

replies(2): >>43768773 #>>43774589 #
1. ArnavAgrawal03 ◴[] No.43774589[source]
totally agree that air-gapped provides unparalleled peace of mind. That's a major reason why we have strong support for local deployment. Nice to know that our hypothesis is somewhat accurate :)