Ban me at the IP level if you don't like me

1. 8organicbits ◴[25 Aug 25 11:39 UTC] No.45012759[source]▶

I've been working on a web crawler and have been trying to make it as friendly as possible. Strictly checking robots.txt, crawling slowly, clear identification in the User Agent string, single IP source address. But I've noticed some anti-bot tricks getting applied to the robot.txt file itself. The latest was a slow loris approach where it takes forever for robots.txt to download. I accidentally treated this as a 404, which then meant I continued to crawl that site. I had to change the code so a robots.txt timeout is treated like a Disallow /.

It feels odd because I find I'm writing code to detect anti-bot tools even though I'm trying my best to follow conventions.

replies(7): >>45013175 #>>45014774 #>>45015149 #>>45018582 #>>45018859 #>>45020630 #>>45027106 #

2. navane ◴[25 Aug 25 12:36 UTC] No.45013175[source]▶

>>45012759 (TP) #

That's like deterring burglars by hiding your doorbell

3. brianwawok ◴[25 Aug 25 15:15 UTC] No.45014774[source]▶

>>45012759 (TP) #

I doubt that’s on purpose. The bad guys that don’t follow robots don’t bother downloading it.

Never assume malice what can be attributed to incompetence.

replies(2): >>45020789 #>>45023736 #

4. NegativeK ◴[25 Aug 25 15:47 UTC] No.45015149[source]▶

>>45012759 (TP) #

I really appreciate you giving a shit. Not sarcastically -- it seems like you're actually doing everything right, and it makes a difference.

Gating robots.txt might be a mistake, but it also might be a quick way to deal with crawlers who mine robots.txt for pages that are more interesting. It's also a page that's never visited by humans. So if you make it a tarpit, you both refuse to give the bot more information and slow it down.

It's crap that it's affecting your work, but a website owner isn't likely to care about the distinction when they're pissed off at having to deal with bad actors that they should never have to care about.

replies(2): >>45015402 #>>45019078 #

5. gabeio ◴[25 Aug 25 16:10 UTC] No.45015402[source]▶

>>45015149 #

> It's also a page that's never visited by humans.

Never is a strong word. I have definitely visited robots.txt of various websites for a variety of random reasons.

  - remembering the format
  - seeing what they might have tried to "hide"
  - using it like a site's directory
  - testing if the website is working if their main dashboard/index is offline

replies(1): >>45015588 #

6. sdenton4 ◴[25 Aug 25 16:25 UTC] No.45015588{3}[source]▶

>>45015402 #

Are you sure you are human?

replies(1): >>45018384 #

7. gspencley ◴[25 Aug 25 20:08 UTC] No.45018384{4}[source]▶

>>45015588 #

Yes. I have checked many checkboxes that say "Verify You Are a Human" and they have always confirmed that I am.

In fairness, however, my daughters ask me that question all the time and it is possible that the verification checkboxes are lying to me as part of some grand conspiracy to make me think I am a human when I am not.

replies(1): >>45019737 #

8. ronsor ◴[25 Aug 25 20:27 UTC] No.45018582[source]▶

>>45012759 (TP) #

> The latest was a slow loris approach where it takes forever for robots.txt to download.

I'd treat this in a client the same way as I do in a server application. If the peer is behaving maliciously or improperly, I silently drop the TCP connection without notifying the other party. They can waste their resources by continuing to send bytes for the next few minutes until their own TCP stack realizes what happens.

replies(1): >>45019200 #

9. fsckboy ◴[25 Aug 25 20:54 UTC] No.45018859[source]▶

>>45012759 (TP) #

>a slow loris approach

does this refer to the word loris recently and only after several years being added to Wordle™?

replies(2): >>45018961 #>>45020161 #

10. bananananananan ◴[25 Aug 25 21:01 UTC] No.45018961[source]▶

>>45018859 #

No. A slowloris is an existing attack predating Wordle.

11. ghxst ◴[25 Aug 25 21:11 UTC] No.45019078[source]▶

>>45015149 #

I usually hit robots.txt when I want to make fetch requests to a domain from the console without running into CORS or CSP issues. Since it's just a static file, there's no client-side code interfering, which makes it nice for testing. If you're hunting for vulnerabilities it's also worth probing (especially with crawler UAs), since it can leak hidden endpoints or framework-specific paths that devs didn't expect anyone to notice.

12. conradludgate ◴[25 Aug 25 21:23 UTC] No.45019200[source]▶

>>45018582 #

How do you silently drop a TCP connection? Closing the socket fd usually results in a FIN packet being sent whether I want it to or not.

Additionally, it's not going to be using that many resources before your kernel sends it a RST next time a data packet is sent

replies(1): >>45019297 #

13. ronsor ◴[25 Aug 25 21:32 UTC] No.45019297{3}[source]▶

>>45019200 #

TCP_REPAIR: https://tinselcity.github.io/TCP_Repair/

14. nullc ◴[25 Aug 25 22:12 UTC] No.45019737{5}[source]▶

>>45018384 #

https://www.youtube.com/watch?v=4VrLQXR7mKU

--- though I think passing them is more a sign that you're a robot than anything else.

15. jeltz ◴[25 Aug 25 22:55 UTC] No.45020161[source]▶

>>45018859 #

No, why would it? The attack was named after the animal, slow loris, many years ago.

16. jandrese ◴[25 Aug 25 23:56 UTC] No.45020630[source]▶

>>45012759 (TP) #

> The latest was a slow loris approach where it takes forever for robots.txt to download

Applying penalties that exclusively hurt people who are trying to be respectful seems counterproductive.

17. cyanydeez ◴[26 Aug 25 00:19 UTC] No.45020789[source]▶

>>45014774 #

Its likely just a shitty attempt to rate limit bots

18. aequitas ◴[26 Aug 25 08:16 UTC] No.45023736[source]▶

>>45014774 #

Bad guys might download the robots.txt to find out the stuff they don't want them to crawl.

19. Snacklive ◴[26 Aug 25 14:33 UTC] No.45027106[source]▶

>>45012759 (TP) #

would you be interested in writing an article about it ? sounds really interesting

replies(1): >>45065314 #

20. 8organicbits ◴[29 Aug 25 15:22 UTC] No.45065314[source]▶

>>45027106 #

Yeah, seems like a good fit. It will end up here, if I get to it.

https://alexsci.com/blog/