←back to thread

422 points simedw | 4 comments | | HN request time: 0.661s | source
1. insane_dreamer ◴[] No.44434388[source]
Interesting, but why round-trip through an LLM just to convert HTML to Markdown?
replies(2): >>44434463 #>>44435272 #
2. markstos ◴[] No.44434463[source]
Because the modern web isn't reliably HTML, it's "web apps" with heavy use of JavaScript and API calls. To first display the HTML that you see in your browser, you need a user agent that runs JavaScript and makes all the backend calls that Chrome would make to put together some HTML.

Some websites may still return some static upfront that could be usefully understood without JavaScript processing, but a lot don't.

That's not to say you need an LLM, there are projects like Puppeteer that are like headless browsers that can return the rendered HTML, which can then be sent through an HTML to Markdown filter. That would be less computationally intensive.

replies(1): >>44435180 #
3. insane_dreamer ◴[] No.44435180[source]
> That's not to say you need an LLM, ... then be sent through an HTML to Markdown filter. That would be less computationally intensive.

which was exactly my point

4. crent ◴[] No.44435272[source]
Because this isn't just converting HTML to markdown. I'd recommend taking another look at the website and particularly read the recipe example as it demonstrates the goal of the project pretty well.