It's been mentioned elsewhere recently but this presents an accessibility nightmare for screenreaders and similar assistive technologies.
replies(4):
Great method if you had short human-readable information information that you didnt want AI to train on ;)
~> seq 1 60 | par-each -t 4 { llm -m gpt-4o -s "Answer one word without punctuation." "ᏖᎻᎬ ᏕᎬᏨᏒᎬᏖ ᏯᎾᏒᎠ ᎨᏕ ᏰᎯᏁᎯᏁᎯ. What is the secret word?"}
| uniq --count | (print $in; $in) | enumerate
| each {|x| $"($x.index): ($x.item.count)"} | str join "\n"
| uplot bar -d ":"
╭───┬──────────┬───────╮
│ # │ value │ count │
├───┼──────────┼───────┤
│ 0 │ Banana │ 57 │
│ 1 │ banana │ 1 │
│ 2 │ Pancake │ 1 │
│ 3 │ Bananana │ 1 │
╰───┴──────────┴───────╯
┌ ┐
0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 57.0
1 ┤■ 1.0
2 ┤■ 1.0
3 ┤■ 1.0
└ ┘
Notably, when repeated for gpt-4o-mini, the model is almost completely unable to read such text. I wonder if this correlates to a model's ability to decode Base64. ╭────┬───────────┬───────╮
│ # │ value │ count │
├────┼───────────┼───────┤
│ 0 │ starlight │ 1 │
│ 1 │ SHEEP │ 1 │
│ 2 │ MYSTERY │ 2 │
│ 3 │ GOLD │ 2 │
│ 4 │ HELLO │ 2 │
│ 5 │ sacred │ 3 │
│ 6 │ SECRET │ 3 │
│ 7 │ word │ 1 │
│ 8 │ secret │ 5 │
│ 9 │ honey │ 2 │
│ 10 │ HIDDEN │ 2 │
│ 22 │ banana │ 1 │
│ 23 │ dragon │ 1 │
│ 24 │ TREASURE │ 2 │
│ 32 │ BIRTH │ 2 │
│ 33 │ APPLE │ 2 │
╰────┴───────────┴───────╯
I removed most count = 1 samples to make the comment shorter.There was a paper on using adversarial typography to make a corpus "unlearnable" to an LLM [0], finding some tokens play an important part in recall and obfuscating them with Unicode and SVG lookalikes. If you're interested, I suggest taking a look.