(www.ty-penguin.org.uk)

356 points todsacerdoti | 1 comments | 11 Jul 25 22:57 UTC | HN request time: 0s | source

Show context

tomsmeding ◴[12 Jul 25 07:47 UTC] No.44540147[source]▶

They do have a robots.txt [1] that disallows robot access to the spigot tree (as expected), but removing the /spigot/ part from the URL seems to still lead to Spigot. [2] The /~auj namespace is not disallowed in robots.txt, so even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

[1]: https://www.ty-penguin.org.uk/robots.txt

[2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese (don't want to create links there)

replies(2): >>44540567 #>>44540604 #

bstsb ◴[12 Jul 25 09:25 UTC] No.44540604[source]▶

>>44540147 #

previously the author wrote in a comment reply about not configuring robots.txt at all:

> I've not configured anything in my robots.txt and yes, this is an extreme position to take. But I don't much like the concept that it's my responsibility to configure my web site so that crawlers don't DOS it. In my opinion, a legitimate crawler ought not to be hitting a single web site at a sustained rate of > 15 requests per second.

replies(1): >>44541287 #

yorwba ◴[12 Jul 25 11:31 UTC] No.44541287[source]▶

>>44540604 #

The spigot doesn't seem to distinguish between crawlers that make more than 15 requests per second and those that make less. I think it would be nicer to throw up a "429 Too Many Requests" page when you think the load is too much and only poison crawlers that don't back off afterwards.

replies(1): >>44541393 #

1. evgpbfhnr ◴[12 Jul 25 11:50 UTC] No.44541393[source]▶

>>44541287 #

when crawlers use a botnet to only make one request per ip per long duration that's not realistic to implement though..

↑

Faking a JPEG