(cryptography.dog)

255 points ColinWright | 1 comments | 31 Oct 25 15:44 UTC | HN request time: 0.21s | source

Show context

OhMeadhbh ◴[31 Oct 25 18:16 UTC] No.45775001[source]▶

I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

replies(5): >>45775128 #>>45775617 #>>45776644 #>>45776976 #>>45780383 #

1. mikeiz404 ◴[31 Oct 25 21:37 UTC] No.45776976[source]▶

>>45775001 #

It’s been some time since I have dealt with web scrapers but it takes less resources to run a regex than it does to parse the DOM (which may have syntactically incorrect parts anyway). This can add up when running many scraping requests in parallel. So depending on your goals using a regex can be much preferred.

↑

AI scrapers request commented scripts