Ingesting PDFs and why Gemini 2.0 changes everything

I've been working on something similar the past couple months. A few thoughts:

- A lot of natural chunk boundaries span multiple pages, so you need some 'sliding window' mechanism for the best accuracy.

- Passing the entire document hurts throughput too much due to the quadratic complexity of attention. Outputs are also much worse when you use too much context.

- Bounding boxes can be solved by first generating boxes using tradition OCR / layout recognition, then passing that data to the LLM. The LLM can then link it's outputs to the boxes. Unfortunately getting this reliable required a custom sampler so proprietary models like Gemini are out of the question.