←back to thread

597 points classichasclass | 1 comments | | HN request time: 0.333s | source
Show context
8organicbits ◴[] No.45012759[source]
I've been working on a web crawler and have been trying to make it as friendly as possible. Strictly checking robots.txt, crawling slowly, clear identification in the User Agent string, single IP source address. But I've noticed some anti-bot tricks getting applied to the robot.txt file itself. The latest was a slow loris approach where it takes forever for robots.txt to download. I accidentally treated this as a 404, which then meant I continued to crawl that site. I had to change the code so a robots.txt timeout is treated like a Disallow /.

It feels odd because I find I'm writing code to detect anti-bot tools even though I'm trying my best to follow conventions.

replies(7): >>45013175 #>>45014774 #>>45015149 #>>45018582 #>>45018859 #>>45020630 #>>45027106 #
fsckboy ◴[] No.45018859[source]
>a slow loris approach

does this refer to the word loris recently and only after several years being added to Wordle™?

replies(2): >>45018961 #>>45020161 #
1. jeltz ◴[] No.45020161[source]
No, why would it? The attack was named after the animal, slow loris, many years ago.