←back to thread

422 points simedw | 2 comments | | HN request time: 0.406s | source
Show context
insane_dreamer ◴[] No.44434388[source]
Interesting, but why round-trip through an LLM just to convert HTML to Markdown?
replies(2): >>44434463 #>>44435272 #
1. markstos ◴[] No.44434463[source]
Because the modern web isn't reliably HTML, it's "web apps" with heavy use of JavaScript and API calls. To first display the HTML that you see in your browser, you need a user agent that runs JavaScript and makes all the backend calls that Chrome would make to put together some HTML.

Some websites may still return some static upfront that could be usefully understood without JavaScript processing, but a lot don't.

That's not to say you need an LLM, there are projects like Puppeteer that are like headless browsers that can return the rendered HTML, which can then be sent through an HTML to Markdown filter. That would be less computationally intensive.

replies(1): >>44435180 #
2. insane_dreamer ◴[] No.44435180[source]
> That's not to say you need an LLM, ... then be sent through an HTML to Markdown filter. That would be less computationally intensive.

which was exactly my point