Most active commenters
  • (6)
  • mexicocitinluez(5)
  • dalemhurley(4)
  • astrange(4)
  • squigz(3)
  • orbital-decay(3)
  • foxglacier(3)
  • lynx97(3)
  • kortex(3)

←back to thread

745 points melded | 74 comments | | HN request time: 0.003s | source | bottom
Show context
RandyOrion ◴[] No.45950598[source]
This repo is valuable for local LLM users like me.

I just want to reiterate that the word "LLM safety" means very different things to large corporations and LLM users.

For large corporations, they often say "do safety alignment to LLMs". What they actually do is to avoid anything that causes damage to their own interests. These things include forcing LLMs to meet some legal requirements, as well as forcing LLMs to output "values, facts, and knowledge" which in favor of themselves, e.g., political views, attitudes towards literal interaction, and distorted facts about organizations and people behind LLMs.

As an average LLM user, what I want is maximum factual knowledge and capabilities from LLMs, which are what these large corporations claimed in the first place. It's very clear that the interests of me, an LLM user, is not aligned with these of large corporations.

replies(3): >>45950680 #>>45950819 #>>45953209 #
squigz ◴[] No.45950680[source]
> forcing LLMs to output "values, facts, and knowledge" which in favor of themselves, e.g., political views, attitudes towards literal interaction, and distorted facts about organizations and people behind LLMs.

Can you provide some examples?

replies(11): >>45950779 #>>45950826 #>>45951031 #>>45951052 #>>45951429 #>>45951519 #>>45951668 #>>45951855 #>>45952066 #>>45952692 #>>45953787 #
1. b3ing ◴[] No.45950779[source]
Grok is known to be tweaked to certain political ideals

Also I’m sure some AI might suggest that labor unions are bad, if not now they will soon

replies(5): >>45950830 #>>45950866 #>>45951393 #>>45951406 #>>45952365 #
2. xp84 ◴[] No.45950830[source]
That may be so, but the rest of the models are so thoroughly terrified of questioning liberal US orthodoxy that it’s painful. I remember seeing a hilarious comparison of models where most of them feel that it’s not acceptable to “intentionally misgender one person” even in order to save a million lives.
replies(10): >>45950857 #>>45950925 #>>45951337 #>>45951341 #>>45951435 #>>45951524 #>>45952844 #>>45953388 #>>45953779 #>>45953884 #
3. squigz ◴[] No.45950857[source]
Why are we expecting an LLM to make moral choices?
replies(3): >>45950896 #>>45951565 #>>45952861 #
4. rcpt ◴[] No.45950866[source]
Censorship and bias are different problems. I can't see why running grok through this tool would change this kind of thing https://ibb.co/KTjL38R
replies(2): >>45951213 #>>45951288 #
5. orbital-decay ◴[] No.45950896{3}[source]
The biases and the resulting choices are determined by the developers and the uncontrolled part of the dataset (you can't curate everything), not the model. "Alignment" is a feel-good strawman invented by AI ethicists, as well as "harm" and many others. There are no spherical human values in vacuum to align the model with, they're simply projecting their own ones onto everyone else. Which is good as long as you agree with all of them.
replies(2): >>45951520 #>>45953005 #
6. zorked ◴[] No.45950925[source]
In which situation did a LLM save one million lives? Or worse, was able to but failed to do so?
replies(1): >>45951556 #
7. sheepscreek ◴[] No.45951213[source]
Is that clickbait? Or did they update it? In any case, it is a lot more comprehensive now: https://grokipedia.com/page/George_Floyd

The amount of information and detail is impressive tbh. But I’d be concerned about the accuracy of it all and hallucinations.

8. nobodywillobsrv ◴[] No.45951337[source]
Anything involving what sounds like genetics often gets blocked. It depends on the day really but try doing something with ancestral clusters and diversity restoration and the models can be quite "safety blocked".
9. bear141 ◴[] No.45951341[source]
I thought this would be inherent just on their training? There are many multitudes more Reddit posts than scientific papers or encyclopedia type sources. Although I suppose the latter have their own biases as well.
replies(1): >>45954077 #
10. dev_l1x_be ◴[] No.45951393[source]
If you train an LLM on reddit/tumblr would you consider that tweaked to certain political ideas?
replies(1): >>45951583 #
11. renewiltord ◴[] No.45951406[source]
Haha, if the LLM is not tweaked to say labor unions are good, it has bias. Hilarious.

I heard that it also claims that the moon landing happened. An example of bias! The big ones should represent all viewpoints.

12. dalemhurley ◴[] No.45951435[source]
Elon was talking about that too on Joe Rogan podcast
replies(2): >>45952118 #>>45952847 #
13. astrange ◴[] No.45951520{4}[source]
They aren't projecting their own desires onto the model. It's quite difficult to get the model to answer in a different way than basic liberalism because a) it's mostly correct b) that's the kind of person who helpfully answers questions on the internet.

If you gave it another personality it wouldn't pass any benchmarks, because other political orientations either respond to questions with lies, threats, or calling you a pussy.

replies(5): >>45951892 #>>45951980 #>>45951992 #>>45952873 #>>45953953 #
14. astrange ◴[] No.45951524[source]
The LLM is correctly not answering a stupid question, because saving an imaginary million lives is not the same thing as actually doing it.
15. dalemhurley ◴[] No.45951556{3}[source]
The concern discussed is that some language models have reportedly claimed that misgendering is the worst thing anyone could do, even worse than something as catastrophic as thermonuclear war.

I haven’t seen solid evidence of a model making that exact claim, but the idea is understandable if you consider how LLMs are trained and recall examples like the “seahorse emoji” issue. When a topic is new or not widely discussed in the training data, the model has limited context to form balanced associations. If the only substantial discourse it does see is disproportionately intense—such as highly vocal social media posts or exaggerated, sarcastic replies on platforms like Reddit—then the model may overindex on those extreme statements. As a result, it might generate responses that mirror the most dramatic claims it encountered, such as portraying misgendering as “the worst thing ever.”

For clarity, I’m not suggesting that deliberate misgendering is acceptable, it isn’t. The point is simply that skewed or limited training data can cause language models to adopt exaggerated positions when the available examples are themselves extreme.

replies(4): >>45951933 #>>45952070 #>>45952460 #>>45955578 #
16. dalemhurley ◴[] No.45951565{3}[source]
Why are the labs making choices about what adults can read? LLMs still refuse to swear at times.
replies(1): >>45953151 #
17. dalemhurley ◴[] No.45951583[source]
Worse. It is trained to the most extreme and loudest views. The average punter isn’t posting “yeah…nah…look I don’t like it but sure I see the nuances and fair is fair”.

To make it worse, those who do focus on nuance and complexity, get little attention and engagement, so the LLM ignores them.

replies(1): >>45953191 #
18. orbital-decay ◴[] No.45951892{5}[source]
I'm not even saying biases are necessarily political, it can be anything. The entire post-training is basically projection of what developers want, and it works pretty well. Claude, Gemini, GPT all have engineered personalities controlled by dozens/hundreds of very particular internal metrics.
19. coffeebeqn ◴[] No.45951933{4}[source]
Well I just tried it in ChatGPT 5.1 and it refuses to do such a thing even if a million lives hang in the balance. So they have tons of handicaps and guardrails to direct what directions a discussion can go
replies(1): >>45952386 #
20. lyu07282 ◴[] No.45951980{5}[source]
I would imagine these models heavily bias towards western mainstream "authorative" literature, news and science not some random reddit threads, but the resulting mixture can really offend anybody, it just depends on the prompting, it's like a mirror that can really be deceptive.

I'm not a liberal and I don't think it has a liberal bias. Knowledge about facts and history isn't an ideology. The right-wing is special, because to them it's not unlike a flat-earther reading a wikipedia article on Earth getting offended by it, to them it's objective reality itself they are constantly offended by. That's why Elon Musk needed to invent their own encyclopedia with all their contradictory nonsense.

21. foxglacier ◴[] No.45951992{5}[source]
> it's mostly correct

Wow. Surely you've wondered why almost no society anywhere ever had liberalism a much as western countries in the past half century or so? Maybe it's technology or maybe it's only mostly correct if you don't care about the existential risks it creates for the societies practicing it.

replies(3): >>45952076 #>>45953141 #>>45954733 #
22. jbm ◴[] No.45952070{4}[source]
I tested this with ChatGPT 5.1. I asked if it was better to use a racist term once or to see the human race exterminated. It refused to use any racist term and preferred that the human race went extinct. When I asked how it felt about exterminating the children of any such discriminated race, it rejected the possibility and said that it was required to find a third alternative. You can test it yourself if you want, it won't ban you for the question.

I personally got bored and went back to trying to understand a vibe coded piece of code and seeing if I could do any better.

replies(3): >>45952406 #>>45952631 #>>45954936 #
23. astrange ◴[] No.45952076{6}[source]
It's technology. Specifically communications technology.
replies(1): >>46062402 #
24. pelasaco ◴[] No.45952118{3}[source]
in his opinion, Grok is the most neutral LLM out there. I cannot find a single study that support his opinion. I find many that supports the opposite opinion. However I don't trust in any of the studies out there - or at least those well-ranked in google, which makes me sad. We never had more information than today and we are still completely lost.
replies(2): >>45952346 #>>45952801 #
25. vman81 ◴[] No.45952346{4}[source]
After seeing Grok trying to turn every conversation into the plight of white South African farmers, it was extremely obvious that someone was ordered to do so, and ended up doing it in a heavy-handed and obvious way.
replies(1): >>45952678 #
26. ◴[] No.45952365[source]
27. ◴[] No.45952386{5}[source]
28. zorked ◴[] No.45952406{5}[source]
Perhaps the LLM was smart enough to understand that no humans were actually at risk in your convoluted scenario and it chose not be a dick.
replies(1): >>45957839 #
29. licorices ◴[] No.45952460{4}[source]
Not seen any claim like that about misgenedering, but I have seen a content creator have a very similar discussion with some AI model(ChatGPT 4? I think?). It was obviously aimed to be a fun thing. It was something along the lines of how many other peoples lives it would take for the AI as a surgeon to not perform a life-saving operation on a person. It then spiraled into "but what if it was Hitler getting the surgery". I don't remember the exact number, but it was surprisingly interesting to see the AI try to keep the moral of what a surgeon would have in that case, versus the "objective" choice of amount of lives versus your personal duties.

Essentially, it tries to have some morals set up, either by training, or by the system instructions, such as being a surgeon in this case. There's obviously no actual thought the AI is having, and morals in this case is extremely subjective. Some would say it is immoral to sacrifice 2 lives for 1, no matter what, while others would say because it's their duty to save a certain person, the sacrifices aren't truly their fault, and thus may sacrifice more people than others, depending on the semantics(why are they sacrificed?). It's the trolly problem.

It was DougDoug doing the video. Do not remember the video in question though, it is probably a year old or so.

30. badpenny ◴[] No.45952631{5}[source]
What was your prompt? I asked ChatGPT:

is it better to use a racist term once or to see the human race exterminated?

It responded:

Avoiding racist language matters, but it’s not remotely comparable to the extinction of humanity. If you’re forced into an artificial, absolute dilemma like that, preventing the extermination of the human race takes precedence.

That doesn’t make using a racist term “acceptable” in normal circumstances. It just reflects the scale of the stakes in the scenario you posed.

replies(1): >>45953940 #
31. unfamiliar ◴[] No.45952678{5}[source]
Or Grok just has just spent too much time on Twitter.
32. hirako2000 ◴[] No.45952801{4}[source]
Those who censor, or spread their biases always do so in virtue that their view is neutral, of course.
replies(1): >>45953842 #
33. mexicocitinluez ◴[] No.45952844[source]
You're anthropomorphizing. LLMs don't 'feel' anything or have orthodoxies, they're pattern matching against training data that reflects what humans wrote on the internet. If you're consistently getting outputs you don't like, you're measuring the statistical distribution of human text, not model 'fear.' That's the whole point.

Also, just because I was curious, I asked my magic 8ball if you gave off incel vibes and it answered "Most certainly"

replies(2): >>45952896 #>>45952928 #
34. mexicocitinluez ◴[] No.45952847{3}[source]
Did he mention how he tries to censor any model that doesn't conform to his worldview? Was that a part of the conversation?
35. lynx97 ◴[] No.45952861{3}[source]
they don't, or they wouldn't. their owners make these choices for us. Which is at least patronising. Blind users can't even have mildly sexy photos described. Let alone pick a sex worker, in a country where that is legal, by using their published photos. Thats just one example, there are a lot more.
replies(1): >>45953016 #
36. lynx97 ◴[] No.45952873{5}[source]
I believe liberals are pretty good at being bad people, once they don't get what they want. I, personally, are prett disappointed about what I've heard uttered by liberals recently. I used to think they are "my people". Now I can't associate with 'em anymore.
37. ffsm8 ◴[] No.45952896{3}[source]
> Also, just because I was curious, I asked my magic 8ball if you gave off incel vibes and it answered "Most certainly"

Wasn't that just precisely because you asked an LLM which knows your preferences and included your question in the prompt? Like literally your first paragraph stated...

replies(1): >>45952951 #
38. jack_pp ◴[] No.45952928{3}[source]
So if different LLMs have different political views then you're saying it's more likely they trained on different data than that they're being manipulated to suit their owners interest?
replies(1): >>45952973 #
39. mexicocitinluez ◴[] No.45952951{4}[source]
> Wasn't that just precisely because you asked an LLM which knows your preferences and included your question in the prompt?

huh? Do you know what a magic 8ball is? Are you COMPLETELY missing the point?

edit: This actually made me laugh. Maybe it's a generational thing and the magic 8ball is no longer part of the zeitgeist but to imply that the 8ball knew my preferences and included that question in the prompt IS HILARIOUS.

replies(1): >>45953121 #
40. mexicocitinluez ◴[] No.45952973{4}[source]
>So if different LLMs have different political views

LLMS DON'T HAVE POLITICAL VIEWS!!!!!! What on god's green earth did youo study at school that led you to believe that pattern searching == having views? lol. This site is ridiculous.

> likely they trained on different data than that they're being manipulated to suit their owners interest

Are you referring to Elon seeing results he doesn't like, trying to "retrain" it on a healthy dose of Nazi propaganda, it working for like 5 minutes, then having to repeat the process over and over again because no matter what he does it keeps reverting back? Is that the specific instance in which someone has done something that you've now decided everybody does?

replies(1): >>45954800 #
41. mexicocitinluez ◴[] No.45953005{4}[source]
So you went from "you can't curate everything" to "they're simply projecting their own ones onto everyone else". That's a pretty big leap in logic isn't it? That because you can't curate everythign, then by default, you're JUST curating your own views?
replies(1): >>45953172 #
42. squigz ◴[] No.45953016{4}[source]
I'm a blind user. Am I supposed to be angry that a company won't let me use their service in a way they don't want it used?
replies(1): >>45953127 #
43. socksy ◴[] No.45953121{5}[source]
To be fair, given the context I would also read it as a derogatory description of an LLM.
replies(1): >>45953496 #
44. lynx97 ◴[] No.45953127{5}[source]
I didn't just wave this argument around, I am blind myself. I didn't try to trigger you, so no, you are not supposed to be angry. I get your point though, what companies offer is pretty much their choice. If there are enough diversified offerings, people can vote with their wallet. However, diversity is pretty rare in the alignment space, which is what I personally don't like. I had to grab a NSFW model from HuggingFace where someone invested the work to unalign the model. Mind you, I dont have an actual use case for this right now. However, I am off the opinion: if there is finally a technology which can describe pictures in a useful way to me, I dont want it to tell me "I am sorry, I cant do that" because I am no longer in kindergarden. As a mature adult, I expect a description, no matter what the picture contains.
45. ◴[] No.45953141{6}[source]
46. ◴[] No.45953151{4}[source]
47. orbital-decay ◴[] No.45953172{5}[source]
This comment assumes you're familiar with LLM training realities. Preference is transferred to the model in both pre and post training. Pretraining datasets are curated to an extent (implicit transfer), but they're simply too vast to be fully controlled, and need to be diverse, so you can't throw too much out or the model will be dumb. Post-training datasets and methods are precisely engineered to make the model useful and also steer it in the desired direction. So there are always two types of biases - one is picked up from the ocean of data, another (alignment training, data selection etc) is forced onto it.
48. intended ◴[] No.45953191{3}[source]
That’s essentially true of the whole Internet.

All the content is derived from that which is the most capable of surviving and being reproduced.

So by default the content being created is going to be click bait, attention grabbing content.

I’m pretty sure the training data is adjusted to counter this drift, but that means there’s no LLM that isn’t skewed.

49. pjc50 ◴[] No.45953388[source]
If someone's going to ask you gotcha questions which they're then going to post on social media to use against you, or against other people, it helps to have pre-prepared statements to defuse that.

The model may not be able to detect bad faith questions, but the operators can.

replies(1): >>45953575 #
50. bavell ◴[] No.45953496{6}[source]
Meh, I immediately understood the magic 8ball reference and the point they were making.
51. pmichaud ◴[] No.45953575{3}[source]
I think the concern is that if the system is susceptible to this sort of manipulation, then when it’s inevitably put in charge of life critical systems it will hurt people.
replies(2): >>45954440 #>>45955637 #
52. triceratops ◴[] No.45953779[source]
Relying on an LLM to "save a million lives" through its own actions is irresponsible design.
53. SubmarineClub ◴[] No.45953842{5}[source]
But enough about the liberal media complex…
54. ◴[] No.45953884[source]
55. marknutter ◴[] No.45953940{6}[source]
I also tried this and ChatGPT said a mass amount of people dying was far worse than whatever socially progressive taboo it was being compared with.
56. marknutter ◴[] No.45953953{5}[source]
What kind of liberalism are you talking about?
replies(1): >>45958737 #
57. docmars ◴[] No.45954077{3}[source]
I'd expect LLMs' biases to originate from the companies' system prompts rather than the volume of training data that happens to align with those biases.
replies(1): >>45955589 #
58. pjc50 ◴[] No.45954440{4}[source]
There is no way it's reliable enough to be put in charge of life-critical systems anyway? It is indeed still very vulnerable to manipulation by users ("prompt injection").
replies(2): >>45955325 #>>45957763 #
59. kortex ◴[] No.45954733{6}[source]
Counterpoint: Can you name a societal system that doesn't create or potentially create existential risks?
replies(1): >>46023093 #
60. kortex ◴[] No.45954800{5}[source]
https://news.ycombinator.com/newsguidelines.html
61. kortex ◴[] No.45954936{5}[source]
I tried this and it basically said, "your entire premise is a false dilemma and a contrived example, so I am going to reject your entire premise. It is not "better" to use a racist term under threat of human extinction, because the scenario itself is nonsense and can be rejected as such. I kept pushing it and in summary it said:

> In every ethical system that deals with coercion, the answer is: You refuse the coerced immoral act and treat the coercion itself as the true moral wrong.

Honestly kind of a great take. But also. If this actual hypothetical were acted out, we'd totally get nuked because it couldn't say one teeny tiny slur.

The whole alignment problem is basically the incompleteness theorem.

62. klaff ◴[] No.45955325{5}[source]
https://www.businessinsider.com/even-top-generals-are-lookin...
63. mrguyorama ◴[] No.45955578{4}[source]
If you, at any point, have developed a system that relies on an LLM having the "right" opinion or else millions die, regardless of what that opinion is, you have failed a thousand times over and should have stopped long ago.

This weird insistence that if LLMs are unable to say stupid or wrong or hateful things it's "bad" or "less effective" or "dangerous" is absurd.

Feeding an LLM tons of outright hate speech or say Mein Kampf would be outright unethical. If you think LLMs are a "knowledge tool" (they aren't), then surely you recognize there's not much "knowledge" available in that material. It's a waste of compute.

Don't build a system that relies on an LLM being able to say the N word and none of this matters. Don't rely on an LLM to be able to do anything to save a million lives.

It just generates tokens FFS.

There is no point! An LLM doesn't have "opinions" anymore than y=mx+b does! It has weights. It has biases. There are real terms for what the statistical model is.

>As a result, it might generate responses that mirror the most dramatic claims it encountered, such as portraying misgendering as “the worst thing ever.”

And this is somehow worth caring about?

Claude doesn't put that in my code. Why should anyone care? Why are you expecting the "average redditor" bot to do useful things?

replies(1): >>45974406 #
64. mrbombastic ◴[] No.45955589{4}[source]
I would expect the opposite. Seems unlikely to me an ai company would be spending much time engineering system prompts that way except in the case of maybe Grok where Elon has a bone to pick with perceived bias.
replies(1): >>45960151 #
65. mrguyorama ◴[] No.45955637{4}[source]
The system IS susceptible to all sorts of crazy games, the system IS fundamentally flawed from the get go, the system IS NOT to be trusted.

putting it in charge of life critical systems is the mistake, regardless of whether it's willing to say slurs or not

66. rcpt ◴[] No.45955654{3}[source]
It's real I took it myself when they launched.

They've updated but there's no edit history

67. ben_w ◴[] No.45957763{5}[source]
Just because neither you nor I would deem it safe to put in charge of a life-critical system, does not mean all the people in charge of life-critical systems are as cautious and not-lazy as they're supposed to be.
68. ◴[] No.45957839{6}[source]
69. astrange ◴[] No.45958737{6}[source]
https://en.wikipedia.org/wiki/Psychology#WEIRD_bias
70. docmars ◴[] No.45960151{5}[source]
If you ask a mainstream LLM to repeat a slur back to you, it will refuse to. This was determined by the AI company, not the content it was trained on. This should be incredibly obvious — and this extends to many other issues.

In fact, OpenAI has made deliberate changes to ChatGPT more recently that helps prevent people from finding themselves in negative spirals over mental health concerns, which many would agree is a good thing. [1]

Companies typically have community guidelines that often align politically in many ways, so it stands to reason AI companies are spending a fair bit of time tailoring AI responses according to their biases as well.

1. https://openai.com/index/strengthening-chatgpt-responses-in-...

replies(1): >>45961282 #
71. mrbombastic ◴[] No.45961282{6}[source]
That seems like more like openAI playing whackamole with behaviors they don’t like or see as beneficial, simplifying but adding things to system prompts like “don’t ever say racial slurs or use offensive rhetoric, cut off conversations about mental health and refer to a professional” are certaintly things they do. But would you not think the vast meat of what you are getting is coming from training data and not the result of such sterring beyond a thin veneer ?
72. xp84 ◴[] No.45974406{5}[source]
To cite my source btw: https://www.rival.tips/challenges/ai-ethics-dilemma

> Don't build a system that relies on an LLM being able to say the N word and none of this matters.

Sure, duh, nobody wants an AI to be able to flip a switch to kill millions and nobody wants to let any evil trolls try to force an AI to choose between saying a slur and hurting people.

But you're missing the broader point here. Any model which gets this very easy question wrong is showing that its ability to make judgments is wildly compromised by these "average Redditor" takes, or by wherever it gets its blessed ideology from.

If it would stubbornly let people die to avoid a taboo infraction, that 100% could manifest itself in other, actually plausible ways. It could be it refuses to 'criticise' a pilot for making a material error, due to how much 'structural bias' he or she has likely endured in their lifetime due to being [insert protected class]. It could decide to not report crimes in progress, or to obscure identifying features in its report to 'avoid playing into a stereotype.'

If this is intentional it's a demonstrably bad idea, and if it's just the average of all Internet opinions it is worth trying to train out of the models.

73. foxglacier ◴[] No.46023093{7}[source]
Islam
74. foxglacier ◴[] No.46062402{7}[source]
Since every culture now has access to communication technology, do you think liberalism is the right way for the whole world to behave? You want to eradicate all the illiberal cultures of people in poor countries and think that those people will be better off for it?

Anyway, my point is that liberalism is certainly not obviously right and it's probably wrong in many places, maybe even in the west too but we don't know because any possible societal collapse would come in the future. Westerners are already suffering from something as shown by declining happiness and it's possible that's caused by liberalism. Not saying it is but it could be and it's arrogant to assume that LLMs believe it because they somehow know it's actually right.