Ban me at the IP level if you don't like me

(boston.conman.org)

597 points classichasclass | 3 comments | 25 Aug 25 04:23 UTC | HN request time: 0.026s | source

Show context

8organicbits ◴[25 Aug 25 11:39 UTC] No.45012759[source]▶

I've been working on a web crawler and have been trying to make it as friendly as possible. Strictly checking robots.txt, crawling slowly, clear identification in the User Agent string, single IP source address. But I've noticed some anti-bot tricks getting applied to the robot.txt file itself. The latest was a slow loris approach where it takes forever for robots.txt to download. I accidentally treated this as a 404, which then meant I continued to crawl that site. I had to change the code so a robots.txt timeout is treated like a Disallow /.

It feels odd because I find I'm writing code to detect anti-bot tools even though I'm trying my best to follow conventions.

replies(7): >>45013175 #>>45014774 #>>45015149 #>>45018582 #>>45018859 #>>45020630 #>>45027106 #

1. ronsor ◴[25 Aug 25 20:27 UTC] No.45018582[source]▶

>>45012759 #

> The latest was a slow loris approach where it takes forever for robots.txt to download.

I'd treat this in a client the same way as I do in a server application. If the peer is behaving maliciously or improperly, I silently drop the TCP connection without notifying the other party. They can waste their resources by continuing to send bytes for the next few minutes until their own TCP stack realizes what happens.

replies(1): >>45019200 #

2. conradludgate ◴[25 Aug 25 21:23 UTC] No.45019200[source]▶

>>45018582 (TP) #

How do you silently drop a TCP connection? Closing the socket fd usually results in a FIN packet being sent whether I want it to or not.

Additionally, it's not going to be using that many resources before your kernel sends it a RST next time a data packet is sent

replies(1): >>45019297 #

3. ronsor ◴[25 Aug 25 21:32 UTC] No.45019297[source]▶

>>45019200 #

TCP_REPAIR: https://tinselcity.github.io/TCP_Repair/

↑