←back to thread

283 points Brajeshwar | 1 comments | | HN request time: 0s | source
Show context
simonw ◴[] No.45231789[source]
Something I'd be interested to understand is how widespread this practice is. Are all of the LLMs trained using human labor that is sometimes exposed to extreme content?

There are a whole lot of organizations training competent LLMs these days in addition to the big three (OpenAI, Google, Anthropic).

What about Mistral and Moonshot and Qwen and DeepSeek and Meta and Microsoft (Phi) and Hugging Face and Ai2 and MBZUAI? Do they all have their own (potentially outsourced) teams of human labelers?

I always look out for notes about this in model cards and papers but it's pretty rare to see any transparency about how this is done.

replies(6): >>45231815 #>>45231866 #>>45231939 #>>45232099 #>>45232271 #>>45234507 #
happy_dog1 ◴[] No.45231939[source]
I've shared this once on HN before, but it's very relevant to this question and just a really great article so I'll reshare it here:

https://www.theverge.com/features/23764584/ai-artificial-int...

it explores the world of outsourced labeling work. Unfortunately hard numbers on the number of people involved are hard to come by because as the article notes:

"This tangled supply chain is deliberately hard to map. According to people in the industry, the companies buying the data demand strict confidentiality. (This is the reason Scale cited to explain why Remotasks has a different name.) Annotation reveals too much about the systems being developed, and the huge number of workers required makes leaks difficult to prevent. Annotators are warned repeatedly not to tell anyone about their jobs, not even their friends and co-workers, but corporate aliases, project code names, and, crucially, the extreme division of labor ensure they don’t have enough information about them to talk even if they wanted to. (Most workers requested pseudonyms for fear of being booted from the platforms.) Consequently, there are no granular estimates of the number of people who work in annotation, but it is a lot, and it is growing. A recent Google Research paper gave an order-of-magnitude figure of “millions” with the potential to become “billions.” "

I too would love to know more about how much human effort is going into labeling and feedback for each of these models, it would be interesting to know.

replies(2): >>45232133 #>>45234569 #
simonw ◴[] No.45232133[source]
That was indeed a great article, but it is a couple of years old now. A lot of of the labeling work described there relates to older forms of machine learning - moderation models, spam labelers, image segmentation etc.

Is it possible in 2025 to train a useful LLM without hiring thousands of labelers? Maybe through application of open datasets (themselves based on human labor) that did not exist two years ago?

replies(1): >>45232321 #
1. happy_dog1 ◴[] No.45232321[source]
Good question, I don't personally know. The linked article would suggest there are plenty of people working on human feedback for chatbots, but that still doesn't give us any hard numbers or any sense of how the number of people involved is changing over time. Perhaps the best datapoint I have is that revenue for SurgeAI (one of many companies that provides data labeling services to Google and OpenAI among others) has grown significantly in recent years, partly due to ScaleAI's acquisition by Meta, and is now at $1.2 billion without having raised any outside VC funding:

https://finance.yahoo.com/news/surge-ai-quietly-hit-1b-15005...

Their continued revenue growth is at least one datapoint to suggest that the number of people working in this field (or at least the amount of money spent on this field) is not decreasing.

Also see the really helpful comment above from cjbarber, there's quite a lot of companies providing these services to foundation model companies. Another datapoint to suggest the number of people working providing labeling / feedback is definitely not decreasing and is more likely increasing. Hard numbers / increased transparency would be nice but I suspect will be hard to find.