←back to thread

257 points ColinWright | 1 comments | | HN request time: 0.234s | source
Show context
OhMeadhbh ◴[] No.45775001[source]
I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

replies(5): >>45775128 #>>45775617 #>>45776644 #>>45776976 #>>45780383 #
1. mrweasel ◴[] No.45780383[source]
You don't need to teach parsing, that won't help much any way. We need to teach people to be good netizen again. I'd argue that it was always viewed as reasonable to scrape content, as long as you didn't misrepresent content as your own and if you scraped responsibly, backing of if the server started to slow down, or simply not crawling to fast to begin with.

Currently we have at least three problems:

1) Companies have no issue with not providing sources and not linking back.

2) There are too many scrapers, even if they behaved, some site would struggle to handle all of them.

3) Srapers go full throttle 24/7, expecting the sites to rate-limit them if they are going to fast. Hammer a site into the ground, just wait until it's back and hammer it again, grabbing what you can before it crashes once more.

There's no longer a sense of the internet being for all of us and that we need to make room for each other. Website / human generated content exists as a resource to be strip mined.