←back to thread

597 points classichasclass | 2 comments | | HN request time: 0.421s | source
Show context
8organicbits ◴[] No.45012759[source]
I've been working on a web crawler and have been trying to make it as friendly as possible. Strictly checking robots.txt, crawling slowly, clear identification in the User Agent string, single IP source address. But I've noticed some anti-bot tricks getting applied to the robot.txt file itself. The latest was a slow loris approach where it takes forever for robots.txt to download. I accidentally treated this as a 404, which then meant I continued to crawl that site. I had to change the code so a robots.txt timeout is treated like a Disallow /.

It feels odd because I find I'm writing code to detect anti-bot tools even though I'm trying my best to follow conventions.

replies(7): >>45013175 #>>45014774 #>>45015149 #>>45018582 #>>45018859 #>>45020630 #>>45027106 #
NegativeK ◴[] No.45015149[source]
I really appreciate you giving a shit. Not sarcastically -- it seems like you're actually doing everything right, and it makes a difference.

Gating robots.txt might be a mistake, but it also might be a quick way to deal with crawlers who mine robots.txt for pages that are more interesting. It's also a page that's never visited by humans. So if you make it a tarpit, you both refuse to give the bot more information and slow it down.

It's crap that it's affecting your work, but a website owner isn't likely to care about the distinction when they're pissed off at having to deal with bad actors that they should never have to care about.

replies(2): >>45015402 #>>45019078 #
gabeio ◴[] No.45015402[source]
> It's also a page that's never visited by humans.

Never is a strong word. I have definitely visited robots.txt of various websites for a variety of random reasons.

  - remembering the format
  - seeing what they might have tried to "hide"
  - using it like a site's directory
  - testing if the website is working if their main dashboard/index is offline
replies(1): >>45015588 #
sdenton4 ◴[] No.45015588[source]
Are you sure you are human?
replies(1): >>45018384 #
1. gspencley ◴[] No.45018384[source]
Yes. I have checked many checkboxes that say "Verify You Are a Human" and they have always confirmed that I am.

In fairness, however, my daughters ask me that question all the time and it is possible that the verification checkboxes are lying to me as part of some grand conspiracy to make me think I am a human when I am not.

replies(1): >>45019737 #
2. nullc ◴[] No.45019737[source]
https://www.youtube.com/watch?v=4VrLQXR7mKU

--- though I think passing them is more a sign that you're a robot than anything else.