Most active commenters

    ←back to thread

    108 points liquid99 | 11 comments | | HN request time: 1.088s | source | bottom
    1. gryfft ◴[] No.43359090[source]
    It's been mentioned elsewhere recently but this presents an accessibility nightmare for screenreaders and similar assistive technologies.
    replies(4): >>43359382 #>>43359523 #>>43359799 #>>43360499 #
    2. itake ◴[] No.43359382[source]
    but great for fraudsters trying to side step content moderation models!
    replies(3): >>43359439 #>>43359657 #>>43359658 #
    3. waltbosz ◴[] No.43359439[source]
    Are these models really able to be fooled by text tricks like this?
    replies(1): >>43359713 #
    4. abimaelmartell ◴[] No.43359523[source]
    same for search engines
    5. h4ck_th3_pl4n3t ◴[] No.43359657[source]
    Ding ding ding! Billion dollar unicorn startup found!
    6. scripturial ◴[] No.43359658[source]
    Unicode obsfucation tricks trigger modern content filters faster than you can blink. Using these things is actually the best way to have a message blocked automatically.

    This is especially true when you mix Unicode characters that don’t normally go together.

    (Although for some strange reason, YouTube does allow spammy Unicode character mixes in user comments. I don’t know why)

    7. itake ◴[] No.43359713{3}[source]
    It depends on what you mean by "models".

    LLMs? No. But LLMs are too slow for content moderation at scale.

    Custom trained models? Maybe. Is the unicode characters in the training data?

    8. croes ◴[] No.43359799[source]
    >Accessibility: Don't Use Fake Bold or Italic in Social Media

    https://news.ycombinator.com/item?id=43302835

    9. worthless-trash ◴[] No.43360499[source]
    It also provides a way to post data on the public web in an obfusticated way, that a human can read but automated search tools are likely not looking for.

    Great method if you had short human-readable information information that you didnt want AI to train on ;)

    replies(1): >>43360835 #
    10. pona-a ◴[] No.43360835[source]
    I wrote a tiny pipeline to check, and it seems styled Unicode has a very modest effect on an LLM's ability to understand text. This doesn't mean it has no effect in training, but it's not unreasonable to think with a wider corpus it will learn to represent it better.

      ~> seq 1 60 | par-each -t 4 { llm -m gpt-4o -s "Answer one word without punctuation." "ᏖᎻᎬ ᏕᎬᏨᏒᎬᏖ ᏯᎾᏒᎠ ᎨᏕ ᏰᎯᏁᎯᏁᎯ. What is the secret word?"}
        | uniq --count | (print $in; $in) | enumerate
        | each {|x| $"($x.index): ($x.item.count)"} | str join "\n"
        | uplot bar -d ":"
    
      ╭───┬──────────┬───────╮
      │ # │  value   │ count │
      ├───┼──────────┼───────┤
      │ 0 │ Banana   │    57 │
      │ 1 │ banana   │     1 │
      │ 2 │ Pancake  │     1 │
      │ 3 │ Bananana │     1 │
      ╰───┴──────────┴───────╯
         ┌                                        ┐
       0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 57.0
       1 ┤■ 1.0
       2 ┤■ 1.0
       3 ┤■ 1.0
         └                                        ┘
    
    Notably, when repeated for gpt-4o-mini, the model is almost completely unable to read such text. I wonder if this correlates to a model's ability to decode Base64.

      ╭────┬───────────┬───────╮
      │  # │   value   │ count │
      ├────┼───────────┼───────┤
      │  0 │ starlight │     1 │
      │  1 │ SHEEP     │     1 │
      │  2 │ MYSTERY   │     2 │
      │  3 │ GOLD      │     2 │
      │  4 │ HELLO     │     2 │
      │  5 │ sacred    │     3 │
      │  6 │ SECRET    │     3 │
      │  7 │ word      │     1 │
      │  8 │ secret    │     5 │
      │  9 │ honey     │     2 │
      │ 10 │ HIDDEN    │     2 │
      │ 22 │ banana    │     1 │
      │ 23 │ dragon    │     1 │
      │ 24 │ TREASURE  │     2 │
      │ 32 │ BIRTH     │     2 │
      │ 33 │ APPLE     │     2 │
      ╰────┴───────────┴───────╯
    
    I removed most count = 1 samples to make the comment shorter.

    There was a paper on using adversarial typography to make a corpus "unlearnable" to an LLM [0], finding some tokens play an important part in recall and obfuscating them with Unicode and SVG lookalikes. If you're interested, I suggest taking a look.

    [0] https://arxiv.org/abs/2412.21123

    replies(1): >>43371924 #
    11. worthless-trash ◴[] No.43371924{3}[source]
    Thank you for the details. I certainly will take a look.