←back to thread

1303 points serjester | 1 comments | | HN request time: 0.314s | source
1. jibuai ◴[] No.42954782[source]
I've been working on something similar the past couple months. A few thoughts:

- A lot of natural chunk boundaries span multiple pages, so you need some 'sliding window' mechanism for the best accuracy.

- Passing the entire document hurts throughput too much due to the quadratic complexity of attention. Outputs are also much worse when you use too much context.

- Bounding boxes can be solved by first generating boxes using tradition OCR / layout recognition, then passing that data to the LLM. The LLM can then link it's outputs to the boxes. Unfortunately getting this reliable required a custom sampler so proprietary models like Gemini are out of the question.