←back to thread

257 points ColinWright | 1 comments | | HN request time: 0s | source
Show context
OhMeadhbh ◴[] No.45775001[source]
I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

replies(5): >>45775128 #>>45775617 #>>45776644 #>>45776976 #>>45780383 #
ericmcer ◴[] No.45775617[source]
How would recommend doing it? If I was just trying to pull <a/> tag links out I feel like treating it like text and using regex would be way more efficient than a full on HTML parser like JSDom or something.
replies(1): >>45775750 #
singron ◴[] No.45775750[source]
You don't need javascript to parse HTML. Just use an HTML parser. They are very fast. HTML isn't a regular language, so you can't parse it with regular expressions.

Obligatory: https://stackoverflow.com/questions/1732348/regex-match-open...

replies(1): >>45777232 #
1. zahlman ◴[] No.45777232[source]
The point is: if you're trying to find all the URLs within the page source, it doesn't really matter to you what tags they're in, or how the document is structured, or even whether they're given as link targets or in the readable text or just what.