←back to thread

255 points ColinWright | 1 comments | | HN request time: 0.2s | source
Show context
OhMeadhbh ◴[] No.45775001[source]
I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

replies(5): >>45775128 #>>45775617 #>>45776644 #>>45776976 #>>45780383 #
1. tuwtuwtuwtuw ◴[] No.45775128[source]
Do you think that if some CS programs taught parsing, the authors of the bot would parse the HTML to properly extract links, instead of just doing plain text search?

I doubt it.