←back to thread

283 points Brajeshwar | 2 comments | | HN request time: 0s | source
Show context
simonw ◴[] No.45231789[source]
Something I'd be interested to understand is how widespread this practice is. Are all of the LLMs trained using human labor that is sometimes exposed to extreme content?

There are a whole lot of organizations training competent LLMs these days in addition to the big three (OpenAI, Google, Anthropic).

What about Mistral and Moonshot and Qwen and DeepSeek and Meta and Microsoft (Phi) and Hugging Face and Ai2 and MBZUAI? Do they all have their own (potentially outsourced) teams of human labelers?

I always look out for notes about this in model cards and papers but it's pretty rare to see any transparency about how this is done.

replies(6): >>45231815 #>>45231866 #>>45231939 #>>45232099 #>>45232271 #>>45234507 #
whilenot-dev ◴[] No.45231866[source]
So why do you think asking this question here would yield a satisfying answer, especially how the HN community likes to dispute any vague conclusions for anything as hyped as AI training?

To counter your question, what makes you think that's not the case? Do you think Mistral/Moonshot/Qwen/etc. are all employing their own data labelers? Why would you expect this kind of transparency from for-profit bodies that are evaluated in the billions?

replies(1): >>45232081 #
simonw ◴[] No.45232081[source]
If you don't ask the question you'll definitely not get an answer. Given how many AI labs follow Hacker News it's not a bad place to pose this.

"what makes you think that's not the case?"

I genuinely do not have enough information to form an opinion one way or the other.

replies(1): >>45232150 #
1. whilenot-dev ◴[] No.45232150{3}[source]
> If you don't ask the question you'll definitely not get an answer.

Sure, but the way you're formulating the question is already casting an opinion. Besides, no one could even attempt to answer your questions without falling into the trap of true diligence... one question just asks how all (with emphasis!) LLMs are trained:

> Are all of the LLMs trained using human labor that is sometimes exposed to extreme content?

Who in the world would even be in such a position?

replies(1): >>45232291 #
2. simonw ◴[] No.45232291[source]
That question could be answered by proving the opposite: if someone has trained a single competent LLM without any human labor that was exposed to extreme content then not all LLMs were trained that way.