←back to thread

Faking a JPEG

(www.ty-penguin.org.uk)
356 points todsacerdoti | 1 comments | | HN request time: 0s | source
Show context
tomsmeding ◴[] No.44540147[source]
They do have a robots.txt [1] that disallows robot access to the spigot tree (as expected), but removing the /spigot/ part from the URL seems to still lead to Spigot. [2] The /~auj namespace is not disallowed in robots.txt, so even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

[1]: https://www.ty-penguin.org.uk/robots.txt

[2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese (don't want to create links there)

replies(2): >>44540567 #>>44540604 #
bstsb ◴[] No.44540604[source]
previously the author wrote in a comment reply about not configuring robots.txt at all:

> I've not configured anything in my robots.txt and yes, this is an extreme position to take. But I don't much like the concept that it's my responsibility to configure my web site so that crawlers don't DOS it. In my opinion, a legitimate crawler ought not to be hitting a single web site at a sustained rate of > 15 requests per second.

replies(1): >>44541287 #
yorwba ◴[] No.44541287[source]
The spigot doesn't seem to distinguish between crawlers that make more than 15 requests per second and those that make less. I think it would be nicer to throw up a "429 Too Many Requests" page when you think the load is too much and only poison crawlers that don't back off afterwards.
replies(1): >>44541393 #
1. evgpbfhnr ◴[] No.44541393[source]
when crawlers use a botnet to only make one request per ip per long duration that's not realistic to implement though..