(cryptography.dog)

257 points ColinWright | 3 comments | 31 Oct 25 15:44 UTC | HN request time: 0s | source

Show context

rokkamokka ◴[31 Oct 25 17:21 UTC] No.45774428[source]▶

I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM

embedding-shape ◴[31 Oct 25 17:24 UTC] No.45774455[source]▶

Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)

replies(1): >>45775540 #

franktankbank ◴[31 Oct 25 19:06 UTC] No.45775540[source]▶

>>45774455 #

Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?

replies(1): >>45775601 #

1. embedding-shape ◴[31 Oct 25 19:12 UTC] No.45775601{3}[source]▶

>>45775540 #

Yes, parse out HTML comments which is also kind of trivial if you've ever done any sort of parsing, listen for "". But then again, these people are using AI to build scrapers, so I wouldn't put too much pressure on them to produce high-quality software.

replies(2): >>45776771 #>>45777479 #

2. stevage ◴[31 Oct 25 21:13 UTC] No.45776771[source]▶

>>45775601 (TP) #

Lots of other ways to include URLs in an HTML document that wouldn't be visible to a real user, though.

3. jcheng ◴[31 Oct 25 22:38 UTC] No.45777479[source]▶

>>45775601 (TP) #

It's not quite as trivial as that; one could start the page with a <script> tag that contains "", and that would hide all the content from your scraper but not from real browsers.

But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.

↑

AI scrapers request commented scripts