←back to thread

Faking a JPEG

(www.ty-penguin.org.uk)
341 points todsacerdoti | 2 comments | | HN request time: 0.44s | source
Show context
tomsmeding ◴[] No.44540147[source]
They do have a robots.txt [1] that disallows robot access to the spigot tree (as expected), but removing the /spigot/ part from the URL seems to still lead to Spigot. [2] The /~auj namespace is not disallowed in robots.txt, so even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

[1]: https://www.ty-penguin.org.uk/robots.txt

[2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese (don't want to create links there)

replies(2): >>44540567 #>>44540604 #
josephg ◴[] No.44540567[source]
> even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

So? What duty do web site operators have to be "nice" to people scraping your website?

replies(2): >>44540606 #>>44540718 #
1. suspended_state ◴[] No.44540718[source]
The point is that not every web crawler is out there to scrape websites.
replies(1): >>44541486 #
2. andybak ◴[] No.44541486[source]
Unless you define "scrape" to be inherently nefarious - then surely they are? Isn't the definition of a web crawler based on scraping websites?