←back to thread

257 points ColinWright | 3 comments | | HN request time: 0s | source
Show context
rokkamokka ◴[] No.45774428[source]
I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM
replies(2): >>45774455 #>>45779997 #
embedding-shape ◴[] No.45774455[source]
Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)
replies(1): >>45775540 #
franktankbank ◴[] No.45775540[source]
Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?
replies(1): >>45775601 #
1. embedding-shape ◴[] No.45775601{3}[source]
Yes, parse out HTML comments which is also kind of trivial if you've ever done any sort of parsing, listen for "<!--", whenever you come across it, ignore everything until the next "-->". But then again, these people are using AI to build scrapers, so I wouldn't put too much pressure on them to produce high-quality software.
replies(2): >>45776771 #>>45777479 #
2. stevage ◴[] No.45776771[source]
Lots of other ways to include URLs in an HTML document that wouldn't be visible to a real user, though.
3. jcheng ◴[] No.45777479[source]
It's not quite as trivial as that; one could start the page with a <script> tag that contains "<!--" without matching "-->", and that would hide all the content from your scraper but not from real browsers.

But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.