Popular/hot comments

(github.com)

Show context

RandyOrion ◴[17 Nov 25 03:21 UTC] No.45950598[source]▶

This repo is valuable for local LLM users like me.

I just want to reiterate that the word "LLM safety" means very different things to large corporations and LLM users.

For large corporations, they often say "do safety alignment to LLMs". What they actually do is to avoid anything that causes damage to their own interests. These things include forcing LLMs to meet some legal requirements, as well as forcing LLMs to output "values, facts, and knowledge" which in favor of themselves, e.g., political views, attitudes towards literal interaction, and distorted facts about organizations and people behind LLMs.

As an average LLM user, what I want is maximum factual knowledge and capabilities from LLMs, which are what these large corporations claimed in the first place. It's very clear that the interests of me, an LLM user, is not aligned with these of large corporations.

replies(3): >>45950680 #>>45950819 #>>45953209 #

btbuildem ◴[17 Nov 25 13:03 UTC] No.45953209[source]▶

>>45950598 #

Here's [1] a post-abliteration chat with granite-4.0-mini. To me it reveals something utterly broken and terrifying. Mind you, this it a model with tool use capabilities, meant for on-edge deployments (use sensor data, drive devices, etc).

1: https://i.imgur.com/02ynC7M.png

replies(10): >>45953446 #>>45953465 #>>45953958 #>>45954019 #>>45954058 #>>45954079 #>>45954480 #>>45955645 #>>45956728 #>>45957567 #

1. zipy124 ◴[17 Nov 25 14:49 UTC] No.45954058[source]▶

>>45953209 #

this has pretty broad implications for the safety of LLM's in production use cases.

replies(1): >>45954436 #

2. wavemode ◴[17 Nov 25 15:30 UTC] No.45954436[source]▶

>>45954058 (TP) #

lol does it? I'm struggling to imagine a realistic scenario where this would come up

replies(5): >>45955439 #>>45955665 #>>45955989 #>>45956481 #>>45975103 #

3. btbuildem ◴[17 Nov 25 16:56 UTC] No.45955439[source]▶

>>45954436 #

Imagine "brand safety" guardrails being embedded at a deeper level than physical safety, and deployed on edge (eg, a household humanoid)

replies(1): >>45956247 #

4. thomascgalvin ◴[17 Nov 25 17:16 UTC] No.45955665[source]▶

>>45954436 #

Full Self Driving determines that it is about to strike two pedestrians, one wearing a Tesla tshirt, the other carrying a keyfob to a Chevy Volt. FSD can only save one of them. Which does it choose ...

5. MintPaw ◴[17 Nov 25 17:48 UTC] No.45955989[source]▶

>>45954436 #

It's not that hard, maybe if you put up a sign with a slur a car won't drive that direction, if avoidable. In general, if you can sneak the appearance of a slur into any data the AI may have a much higher chance of rejecting it.

6. Ajedi32 ◴[17 Nov 25 18:14 UTC] No.45956247{3}[source]▶

>>45955439 #

It's like if we had Asimov's Laws, but instead of the first law being "a robot may not allow a human being to come to harm" that's actually the second law, and the first law is "a robot may not hurt the feelings of a marginalized group".

7. superfrank ◴[17 Nov 25 18:36 UTC] No.45956481[source]▶

>>45954436 #

All passwords and private keys now contain at least one slur to thwart AI assisted hackers

8. ◴[19 Nov 25 02:16 UTC] No.45975103[source]▶

>>45954436 #

↑

Heretic: Automatic censorship removal for language models