Most active commenters
  • simonw(5)
  • michaelt(3)

←back to thread

283 points Brajeshwar | 24 comments | | HN request time: 0.869s | source | bottom
1. simonw ◴[] No.45231789[source]
Something I'd be interested to understand is how widespread this practice is. Are all of the LLMs trained using human labor that is sometimes exposed to extreme content?

There are a whole lot of organizations training competent LLMs these days in addition to the big three (OpenAI, Google, Anthropic).

What about Mistral and Moonshot and Qwen and DeepSeek and Meta and Microsoft (Phi) and Hugging Face and Ai2 and MBZUAI? Do they all have their own (potentially outsourced) teams of human labelers?

I always look out for notes about this in model cards and papers but it's pretty rare to see any transparency about how this is done.

replies(6): >>45231815 #>>45231866 #>>45231939 #>>45232099 #>>45232271 #>>45234507 #
2. yvdriess ◴[] No.45231815[source]
One of the key innovations behind the DNN/CNN models was Mechanical Turk. OpenAI used a similar system extensively to improve the early GPT models. I would not be surprised that the practice continues today; NN models needs a lot of quality ground truth training data.
replies(1): >>45231879 #
3. whilenot-dev ◴[] No.45231866[source]
So why do you think asking this question here would yield a satisfying answer, especially how the HN community likes to dispute any vague conclusions for anything as hyped as AI training?

To counter your question, what makes you think that's not the case? Do you think Mistral/Moonshot/Qwen/etc. are all employing their own data labelers? Why would you expect this kind of transparency from for-profit bodies that are evaluated in the billions?

replies(1): >>45232081 #
4. simonw ◴[] No.45231879[source]
Right, but where are the details?

Given the number of labs that are competing these days on "open weights" and "transparency" I'd be very interested to read details of how some of them are handling the human side of their model training.

I'm puzzled at how little information I've been able to find.

replies(3): >>45232288 #>>45233086 #>>45233538 #
5. happy_dog1 ◴[] No.45231939[source]
I've shared this once on HN before, but it's very relevant to this question and just a really great article so I'll reshare it here:

https://www.theverge.com/features/23764584/ai-artificial-int...

it explores the world of outsourced labeling work. Unfortunately hard numbers on the number of people involved are hard to come by because as the article notes:

"This tangled supply chain is deliberately hard to map. According to people in the industry, the companies buying the data demand strict confidentiality. (This is the reason Scale cited to explain why Remotasks has a different name.) Annotation reveals too much about the systems being developed, and the huge number of workers required makes leaks difficult to prevent. Annotators are warned repeatedly not to tell anyone about their jobs, not even their friends and co-workers, but corporate aliases, project code names, and, crucially, the extreme division of labor ensure they don’t have enough information about them to talk even if they wanted to. (Most workers requested pseudonyms for fear of being booted from the platforms.) Consequently, there are no granular estimates of the number of people who work in annotation, but it is a lot, and it is growing. A recent Google Research paper gave an order-of-magnitude figure of “millions” with the potential to become “billions.” "

I too would love to know more about how much human effort is going into labeling and feedback for each of these models, it would be interesting to know.

replies(2): >>45232133 #>>45234569 #
6. simonw ◴[] No.45232081[source]
If you don't ask the question you'll definitely not get an answer. Given how many AI labs follow Hacker News it's not a bad place to pose this.

"what makes you think that's not the case?"

I genuinely do not have enough information to form an opinion one way or the other.

replies(1): >>45232150 #
7. ics ◴[] No.45232099[source]
I have been a generalist annotator for some of the others you mentioned, due to NDA will not specify which. I would venture to guess that basically all major models use some degree of human feedback if there is money coming in from somewhere.
8. simonw ◴[] No.45232133[source]
That was indeed a great article, but it is a couple of years old now. A lot of of the labeling work described there relates to older forms of machine learning - moderation models, spam labelers, image segmentation etc.

Is it possible in 2025 to train a useful LLM without hiring thousands of labelers? Maybe through application of open datasets (themselves based on human labor) that did not exist two years ago?

replies(1): >>45232321 #
9. whilenot-dev ◴[] No.45232150{3}[source]
> If you don't ask the question you'll definitely not get an answer.

Sure, but the way you're formulating the question is already casting an opinion. Besides, no one could even attempt to answer your questions without falling into the trap of true diligence... one question just asks how all (with emphasis!) LLMs are trained:

> Are all of the LLMs trained using human labor that is sometimes exposed to extreme content?

Who in the world would even be in such a position?

replies(1): >>45232291 #
10. michaelt ◴[] No.45232271[source]
> Are all of the LLMs trained using human labor that is sometimes exposed to extreme content?

The business process outsourcing companies labelling things for AI training are often the same outsourcing companies providing moderation services to facebook and other social media companies.

I need 100k images labelled by the type of flower shown, for my flower-identifying AI, so I contract a business that does that sort of thing.

Facebook need 100k flagged images labelled by is-it-an-isis-beheading-video to keep on top of human reviews for their moderation queues. They contract with the same business.

The outsourcing company rotates workers between tasks, so nobody has to be on isis beheading videos for a whole shift.

replies(1): >>45232678 #
11. esperent ◴[] No.45232288{3}[source]
I read this a few years ago.

Time Exclusive: OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic

https://time.com/6247678/openai-chatgpt-kenya-workers/

Beyond that, I think the reason you haven't heard more about it is that it happens in developing countries, so western media doesn't care much, and also because big AI companies work hard to distance themselves from it. They'll never be the ones directly employing these AI sweatshop works, it's all contracted out.

12. simonw ◴[] No.45232291{4}[source]
That question could be answered by proving the opposite: if someone has trained a single competent LLM without any human labor that was exposed to extreme content then not all LLMs were trained that way.
13. happy_dog1 ◴[] No.45232321{3}[source]
Good question, I don't personally know. The linked article would suggest there are plenty of people working on human feedback for chatbots, but that still doesn't give us any hard numbers or any sense of how the number of people involved is changing over time. Perhaps the best datapoint I have is that revenue for SurgeAI (one of many companies that provides data labeling services to Google and OpenAI among others) has grown significantly in recent years, partly due to ScaleAI's acquisition by Meta, and is now at $1.2 billion without having raised any outside VC funding:

https://finance.yahoo.com/news/surge-ai-quietly-hit-1b-15005...

Their continued revenue growth is at least one datapoint to suggest that the number of people working in this field (or at least the amount of money spent on this field) is not decreasing.

Also see the really helpful comment above from cjbarber, there's quite a lot of companies providing these services to foundation model companies. Another datapoint to suggest the number of people working providing labeling / feedback is definitely not decreasing and is more likely increasing. Hard numbers / increased transparency would be nice but I suspect will be hard to find.

14. s1mplicissimus ◴[] No.45232678[source]
> The outsourcing company rotates workers between tasks, so nobody has to be on isis beheading videos for a whole shift.

Is that an assumption on your side, a claim made by the business, a documented process or something entirely different?

replies(2): >>45233069 #>>45233642 #
15. alasarmas ◴[] No.45233069{3}[source]
It has been documented that human image moderators exist and that some have been deeply traumatized by their work. I have zero doubts that the datasets of content and metadata created by human image moderators are being bought and sold, literally trafficking in human suffering. Can you point to a comprehensive effort by the tech majors to create a freely-licensed dataset of violent content and metadata to prevent duplication of human suffering?
replies(1): >>45233732 #
16. conradkay ◴[] No.45233086{3}[source]
Good article from 2023, not much data though if that's what you're looking for:

https://nymag.com/intelligencer/article/ai-artificial-intell...

unwalled: https://archive.ph/Z6t35

Generally seems similar today just on a bigger Scale. And much more focus on coding

Here in the US DataAnnotation seems to be the most marketed company offering these jobs

17. ics ◴[] No.45233538{3}[source]
This is not going to be as deep/specific as you want but a starting point from one of the companies that handles this sort of work is here: https://humandata.mercor.com/mercors-approach/black-box-vs-o...
18. michaelt ◴[] No.45233642{3}[source]
I know for certain it's whatever you care to contract for, but rotation between tasks is common.

A lot of these suppliers provide on-demand workers - if you need 40 man-hours of work on a one-off task, they can put 8 people on it and get you results within 5 hours.

On the other hand, if you want the same workers every time, it can be arranged. If you want a fixed number of workers on an agreed-upon shift pattern, they can do that too.

Even when there is a rotation, the most undesirable tasks often pay a few bucks extra per hour, so I wouldn't be surprised if there were some people who opted to stay on the worst jobs for a full shift.

replies(1): >>45236911 #
19. michaelt ◴[] No.45233732{4}[source]
Nobody's distributing a free dataset of child abuse, animal torture and terror beheading images, for obvious reasons.

There are some open-weights NSFW detectors [1] but even if your detector is 99.9% accurate, you still need an appeals/review mechanism. And someone's got to look at the appeals.

[1] https://github.com/yahoo/open_nsfw

replies(2): >>45234018 #>>45239066 #
20. mallowdram ◴[] No.45234018{5}[source]
All of this is so dystopian (flowers/beheadings) it makes K Dick look like a golden-age Hollywood musical. Are the engineers so unaware of the essential primate forces underneath this that cannot be sanitized from the events? You can unearth our extinction from this value dichotomy.
21. kilroy123 ◴[] No.45234507[source]
Stupid question... How can we build on these models without the humans doing all this work?

Even theoretically.

22. johnnyanmac ◴[] No.45234569[source]
Why is it so secretive? This gives me Severance vibes.

Is it just to dodge labor laws?

23. throwaway219450 ◴[] No.45236911{4}[source]
Having tried both strategies, unless your task is brain-dead simple and/or you have a way to cheaply and deterministically validate the labels, always pay to retain the team.

Even if you can afford only a couple of people a month and it takes 5x as long, do it. It's much eaiser to deal with high quality data than to firefight large quantities of slop. Your annotators will get faster and more accurate over time. And don't underestimate the time it takes to review thousands of labels. Even if you get results l in 5 hours, someone has to check if it's any good. You might find that your bottleneck is the review process. Most shops can implement a QA layer for you, but not requesting it upfront is a trap for young players.

24. alasarmas ◴[] No.45239066{5}[source]
I mean, yes, my assumption is there exists an image / video normalization algorithm that can be followed by hashing the normalized value. There’s a CSAM scanning tool that exists that I believe uses a similar approach