←back to thread

283 points Brajeshwar | 1 comments | | HN request time: 0s | source
Show context
simonw ◴[] No.45231789[source]
Something I'd be interested to understand is how widespread this practice is. Are all of the LLMs trained using human labor that is sometimes exposed to extreme content?

There are a whole lot of organizations training competent LLMs these days in addition to the big three (OpenAI, Google, Anthropic).

What about Mistral and Moonshot and Qwen and DeepSeek and Meta and Microsoft (Phi) and Hugging Face and Ai2 and MBZUAI? Do they all have their own (potentially outsourced) teams of human labelers?

I always look out for notes about this in model cards and papers but it's pretty rare to see any transparency about how this is done.

replies(6): >>45231815 #>>45231866 #>>45231939 #>>45232099 #>>45232271 #>>45234507 #
michaelt ◴[] No.45232271[source]
> Are all of the LLMs trained using human labor that is sometimes exposed to extreme content?

The business process outsourcing companies labelling things for AI training are often the same outsourcing companies providing moderation services to facebook and other social media companies.

I need 100k images labelled by the type of flower shown, for my flower-identifying AI, so I contract a business that does that sort of thing.

Facebook need 100k flagged images labelled by is-it-an-isis-beheading-video to keep on top of human reviews for their moderation queues. They contract with the same business.

The outsourcing company rotates workers between tasks, so nobody has to be on isis beheading videos for a whole shift.

replies(1): >>45232678 #
s1mplicissimus ◴[] No.45232678[source]
> The outsourcing company rotates workers between tasks, so nobody has to be on isis beheading videos for a whole shift.

Is that an assumption on your side, a claim made by the business, a documented process or something entirely different?

replies(2): >>45233069 #>>45233642 #
alasarmas ◴[] No.45233069[source]
It has been documented that human image moderators exist and that some have been deeply traumatized by their work. I have zero doubts that the datasets of content and metadata created by human image moderators are being bought and sold, literally trafficking in human suffering. Can you point to a comprehensive effort by the tech majors to create a freely-licensed dataset of violent content and metadata to prevent duplication of human suffering?
replies(1): >>45233732 #
michaelt ◴[] No.45233732[source]
Nobody's distributing a free dataset of child abuse, animal torture and terror beheading images, for obvious reasons.

There are some open-weights NSFW detectors [1] but even if your detector is 99.9% accurate, you still need an appeals/review mechanism. And someone's got to look at the appeals.

[1] https://github.com/yahoo/open_nsfw

replies(2): >>45234018 #>>45239066 #
1. mallowdram ◴[] No.45234018[source]
All of this is so dystopian (flowers/beheadings) it makes K Dick look like a golden-age Hollywood musical. Are the engineers so unaware of the essential primate forces underneath this that cannot be sanitized from the events? You can unearth our extinction from this value dichotomy.