←back to thread

257 points ColinWright | 7 comments | | HN request time: 1.174s | source | bottom
1. rokkamokka ◴[] No.45774428[source]
I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM
replies(2): >>45774455 #>>45779997 #
2. embedding-shape ◴[] No.45774455[source]
Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)
replies(1): >>45775540 #
3. franktankbank ◴[] No.45775540[source]
Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?
replies(1): >>45775601 #
4. embedding-shape ◴[] No.45775601{3}[source]
Yes, parse out HTML comments which is also kind of trivial if you've ever done any sort of parsing, listen for "<!--", whenever you come across it, ignore everything until the next "-->". But then again, these people are using AI to build scrapers, so I wouldn't put too much pressure on them to produce high-quality software.
replies(2): >>45776771 #>>45777479 #
5. stevage ◴[] No.45776771{4}[source]
Lots of other ways to include URLs in an HTML document that wouldn't be visible to a real user, though.
6. jcheng ◴[] No.45777479{4}[source]
It's not quite as trivial as that; one could start the page with a <script> tag that contains "<!--" without matching "-->", and that would hide all the content from your scraper but not from real browsers.

But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.

7. marginalia_nu ◴[] No.45779997[source]
The regex approach is certainly easier to implement, but honestly static DOM parsing is pretty cheap, but quite fiddly to get right. You're probably gonna be limited by network congestion (or ephemeral ports) before you run out of CPU time doing this type of crawling.