(zadzmo.org)

714 points blendergeek | 1 comments | 16 Jan 25 13:57 UTC | HN request time: 0.198s | source

Show context

quchen ◴[16 Jan 25 14:32 UTC] No.42725651[source]▶

Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.

replies(9): >>42725708 #>>42725957 #>>42725983 #>>42726183 #>>42726352 #>>42726426 #>>42727567 #>>42728923 #>>42730108 #

grajaganDev ◴[16 Jan 25 14:35 UTC] No.42725708[source]▶

>>42725651 #

I am not sure. How would crawlers filter this?

replies(2): >>42725835 #>>42726294 #

captainmuon ◴[16 Jan 25 14:43 UTC] No.42725835[source]▶

>>42725708 #

Check if the response time, the length of the "main text", or other indicators are in the lowest few percentile -> send to the heap for manual review.

Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators.

Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those.

If you are doing broad crawling, you already need to do this kind of thing anyway.

replies(1): >>42727490 #

1. dylan604 ◴[16 Jan 25 16:31 UTC] No.42727490[source]▶

>>42725835 #

> Hire a bunch of student jobbers,

Do people still do this, or do they just off shore the task?

↑

Nepenthes is a tarpit to catch AI web crawlers