←back to thread

257 points ColinWright | 1 comments | | HN request time: 0.284s | source
Show context
latenightcoding ◴[] No.45774927[source]
when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.
replies(1): >>45775080 #
rightbyte ◴[] No.45775080[source]
DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.
replies(2): >>45775143 #>>45775158 #
1. horseradish7k ◴[] No.45775158[source]
but not when crawling. you don't know the page format in advance - you don't even know what the page contains!