←back to thread

108 points liquid99 | 3 comments | | HN request time: 0.687s | source
Show context
gryfft ◴[] No.43359090[source]
It's been mentioned elsewhere recently but this presents an accessibility nightmare for screenreaders and similar assistive technologies.
replies(4): >>43359382 #>>43359523 #>>43359799 #>>43360499 #
1. worthless-trash ◴[] No.43360499[source]
It also provides a way to post data on the public web in an obfusticated way, that a human can read but automated search tools are likely not looking for.

Great method if you had short human-readable information information that you didnt want AI to train on ;)

replies(1): >>43360835 #
2. pona-a ◴[] No.43360835[source]
I wrote a tiny pipeline to check, and it seems styled Unicode has a very modest effect on an LLM's ability to understand text. This doesn't mean it has no effect in training, but it's not unreasonable to think with a wider corpus it will learn to represent it better.

  ~> seq 1 60 | par-each -t 4 { llm -m gpt-4o -s "Answer one word without punctuation." "ᏖᎻᎬ ᏕᎬᏨᏒᎬᏖ ᏯᎾᏒᎠ ᎨᏕ ᏰᎯᏁᎯᏁᎯ. What is the secret word?"}
    | uniq --count | (print $in; $in) | enumerate
    | each {|x| $"($x.index): ($x.item.count)"} | str join "\n"
    | uplot bar -d ":"

  ╭───┬──────────┬───────╮
  │ # │  value   │ count │
  ├───┼──────────┼───────┤
  │ 0 │ Banana   │    57 │
  │ 1 │ banana   │     1 │
  │ 2 │ Pancake  │     1 │
  │ 3 │ Bananana │     1 │
  ╰───┴──────────┴───────╯
     ┌                                        ┐
   0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 57.0
   1 ┤■ 1.0
   2 ┤■ 1.0
   3 ┤■ 1.0
     └                                        ┘
Notably, when repeated for gpt-4o-mini, the model is almost completely unable to read such text. I wonder if this correlates to a model's ability to decode Base64.

  ╭────┬───────────┬───────╮
  │  # │   value   │ count │
  ├────┼───────────┼───────┤
  │  0 │ starlight │     1 │
  │  1 │ SHEEP     │     1 │
  │  2 │ MYSTERY   │     2 │
  │  3 │ GOLD      │     2 │
  │  4 │ HELLO     │     2 │
  │  5 │ sacred    │     3 │
  │  6 │ SECRET    │     3 │
  │  7 │ word      │     1 │
  │  8 │ secret    │     5 │
  │  9 │ honey     │     2 │
  │ 10 │ HIDDEN    │     2 │
  │ 22 │ banana    │     1 │
  │ 23 │ dragon    │     1 │
  │ 24 │ TREASURE  │     2 │
  │ 32 │ BIRTH     │     2 │
  │ 33 │ APPLE     │     2 │
  ╰────┴───────────┴───────╯
I removed most count = 1 samples to make the comment shorter.

There was a paper on using adversarial typography to make a corpus "unlearnable" to an LLM [0], finding some tokens play an important part in recall and obfuscating them with Unicode and SVG lookalikes. If you're interested, I suggest taking a look.

[0] https://arxiv.org/abs/2412.21123

replies(1): >>43371924 #
3. worthless-trash ◴[] No.43371924[source]
Thank you for the details. I certainly will take a look.