‘Overworked, underpaid’ humans train Google’s AI

(www.theguardian.com)

283 points Brajeshwar | 1 comments | 13 Sep 25 11:30 UTC | HN request time: 0s | source

Show context

simonw ◴[13 Sep 25 13:06 UTC] No.45231789[source]▶

Something I'd be interested to understand is how widespread this practice is. Are all of the LLMs trained using human labor that is sometimes exposed to extreme content?

There are a whole lot of organizations training competent LLMs these days in addition to the big three (OpenAI, Google, Anthropic).

What about Mistral and Moonshot and Qwen and DeepSeek and Meta and Microsoft (Phi) and Hugging Face and Ai2 and MBZUAI? Do they all have their own (potentially outsourced) teams of human labelers?

I always look out for notes about this in model cards and papers but it's pretty rare to see any transparency about how this is done.

replies(6): >>45231815 #>>45231866 #>>45231939 #>>45232099 #>>45232271 #>>45234507 #

yvdriess ◴[13 Sep 25 13:09 UTC] No.45231815[source]▶

>>45231789 #

One of the key innovations behind the DNN/CNN models was Mechanical Turk. OpenAI used a similar system extensively to improve the early GPT models. I would not be surprised that the practice continues today; NN models needs a lot of quality ground truth training data.

replies(1): >>45231879 #

simonw ◴[13 Sep 25 13:18 UTC] No.45231879[source]▶

>>45231815 #

Right, but where are the details?

Given the number of labs that are competing these days on "open weights" and "transparency" I'd be very interested to read details of how some of them are handling the human side of their model training.

I'm puzzled at how little information I've been able to find.

replies(3): >>45232288 #>>45233086 #>>45233538 #

1. ics ◴[13 Sep 25 16:58 UTC] No.45233538[source]▶

>>45231879 #

This is not going to be as deep/specific as you want but a starting point from one of the companies that handles this sort of work is here: https://humandata.mercor.com/mercors-approach/black-box-vs-o...

↑