Faking a JPEG | slacker news

1. tomsmeding ◴[12 Jul 25 07:47 UTC] No.44540147[source]▶

They do have a robots.txt [1] that disallows robot access to the spigot tree (as expected), but removing the /spigot/ part from the URL seems to still lead to Spigot. [2] The /~auj namespace is not disallowed in robots.txt, so even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

[1]: https://www.ty-penguin.org.uk/robots.txt

[2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese (don't want to create links there)

replies(2): >>44540567 #>>44540604 #

2. josephg ◴[12 Jul 25 09:16 UTC] No.44540567[source]▶

>>44540147 (TP) #

> even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

So? What duty do web site operators have to be "nice" to people scraping your website?

replies(2): >>44540606 #>>44540718 #

3. bstsb ◴[12 Jul 25 09:25 UTC] No.44540604[source]▶

>>44540147 (TP) #

previously the author wrote in a comment reply about not configuring robots.txt at all:

> I've not configured anything in my robots.txt and yes, this is an extreme position to take. But I don't much like the concept that it's my responsibility to configure my web site so that crawlers don't DOS it. In my opinion, a legitimate crawler ought not to be hitting a single web site at a sustained rate of > 15 requests per second.

replies(1): >>44541287 #

4. gary_0 ◴[12 Jul 25 09:25 UTC] No.44540606[source]▶

>>44540567 #

The Marginalia search engine or archive.org probably don't deserve such treatment--they're performing a public service that benefits everyone, for free. And it's generally not in one's best interests to serve a bunch of garbage to Google or Bing's crawlers, either.

5. suspended_state ◴[12 Jul 25 09:46 UTC] No.44540718[source]▶

>>44540567 #

The point is that not every web crawler is out there to scrape websites.

replies(1): >>44541486 #

6. yorwba ◴[12 Jul 25 11:31 UTC] No.44541287[source]▶

>>44540604 #

The spigot doesn't seem to distinguish between crawlers that make more than 15 requests per second and those that make less. I think it would be nicer to throw up a "429 Too Many Requests" page when you think the load is too much and only poison crawlers that don't back off afterwards.

replies(1): >>44541393 #

7. evgpbfhnr ◴[12 Jul 25 11:50 UTC] No.44541393{3}[source]▶

>>44541287 #

when crawlers use a botnet to only make one request per ip per long duration that's not realistic to implement though..

8. andybak ◴[12 Jul 25 12:06 UTC] No.44541486{3}[source]▶

>>44540718 #

Unless you define "scrape" to be inherently nefarious - then surely they are? Isn't the definition of a web crawler based on scraping websites?