(cryptography.dog)

257 points ColinWright | 1 comments | 31 Oct 25 15:44 UTC | HN request time: 0s | source

Show context

OhMeadhbh ◴[31 Oct 25 18:16 UTC] No.45775001[source]▶

I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

replies(5): >>45775128 #>>45775617 #>>45776644 #>>45776976 #>>45780383 #

ericmcer ◴[31 Oct 25 19:13 UTC] No.45775617[source]▶

>>45775001 #

How would recommend doing it? If I was just trying to pull <a/> tag links out I feel like treating it like text and using regex would be way more efficient than a full on HTML parser like JSDom or something.

replies(1): >>45775750 #

singron ◴[31 Oct 25 19:27 UTC] No.45775750[source]▶

>>45775617 #

You don't need javascript to parse HTML. Just use an HTML parser. They are very fast. HTML isn't a regular language, so you can't parse it with regular expressions.

Obligatory: https://stackoverflow.com/questions/1732348/regex-match-open...

replies(1): >>45777232 #

1. zahlman ◴[31 Oct 25 22:09 UTC] No.45777232[source]▶

>>45775750 #

The point is: if you're trying to find all the URLs within the page source, it doesn't really matter to you what tags they're in, or how the document is structured, or even whether they're given as link targets or in the readable text or just what.

↑

AI scrapers request commented scripts