I wonder if you could use a less sophisticated model (maybe even something based on LSTMs) to walk over the DOM and extract just the chunks that should be emitted and collected into the browsable data structure, but doing it all locally. I feel like it'd be straightforward to generate training data for this, using an LLM-based toolchain like what the author wrote to be used directly.
replies(1):