←back to thread

255 points ColinWright | 8 comments | | HN request time: 0.767s | source | bottom
1. OhMeadhbh ◴[] No.45775001[source]
I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

replies(5): >>45775128 #>>45775617 #>>45776644 #>>45776976 #>>45780383 #
2. tuwtuwtuwtuw ◴[] No.45775128[source]
Do you think that if some CS programs taught parsing, the authors of the bot would parse the HTML to properly extract links, instead of just doing plain text search?

I doubt it.

3. ericmcer ◴[] No.45775617[source]
How would recommend doing it? If I was just trying to pull <a/> tag links out I feel like treating it like text and using regex would be way more efficient than a full on HTML parser like JSDom or something.
replies(1): >>45775750 #
4. singron ◴[] No.45775750[source]
You don't need javascript to parse HTML. Just use an HTML parser. They are very fast. HTML isn't a regular language, so you can't parse it with regular expressions.

Obligatory: https://stackoverflow.com/questions/1732348/regex-match-open...

replies(1): >>45777232 #
5. vaylian ◴[] No.45776644[source]
The people who do this type of scraping to feed their AI are probably also using AI to write their scraper.
6. mikeiz404 ◴[] No.45776976[source]
It’s been some time since I have dealt with web scrapers but it takes less resources to run a regex than it does to parse the DOM (which may have syntactically incorrect parts anyway). This can add up when running many scraping requests in parallel. So depending on your goals using a regex can be much preferred.
7. zahlman ◴[] No.45777232{3}[source]
The point is: if you're trying to find all the URLs within the page source, it doesn't really matter to you what tags they're in, or how the document is structured, or even whether they're given as link targets or in the readable text or just what.
8. mrweasel ◴[] No.45780383[source]
You don't need to teach parsing, that won't help much any way. We need to teach people to be good netizen again. I'd argue that it was always viewed as reasonable to scrape content, as long as you didn't misrepresent content as your own and if you scraped responsibly, backing of if the server started to slow down, or simply not crawling to fast to begin with.

Currently we have at least three problems:

1) Companies have no issue with not providing sources and not linking back.

2) There are too many scrapers, even if they behaved, some site would struggle to handle all of them.

3) Srapers go full throttle 24/7, expecting the sites to rate-limit them if they are going to fast. Hammer a site into the ground, just wait until it's back and hammer it again, grabbing what you can before it crashes once more.

There's no longer a sense of the internet being for all of us and that we need to make room for each other. Website / human generated content exists as a resource to be strip mined.