Show HN: A website that makes your text look cool anywhere online using Unicode

Unicode obsfucation tricks trigger modern content filters faster than you can blink. Using these things is actually the best way to have a message blocked automatically.

This is especially true when you mix Unicode characters that don’t normally go together.

(Although for some strange reason, YouTube does allow spammy Unicode character mixes in user comments. I don’t know why)

7. itake ◴[14 Mar 25 04:58 UTC] No.43359713{3}[source]▶

>>43359439 #

It depends on what you mean by "models".

LLMs? No. But LLMs are too slow for content moderation at scale.

Custom trained models? Maybe. Is the unicode characters in the training data?

8. croes ◴[14 Mar 25 05:19 UTC] No.43359799[source]▶

>>43359090 (TP) #

>Accessibility: Don't Use Fake Bold or Italic in Social Media

https://news.ycombinator.com/item?id=43302835

9. worthless-trash ◴[14 Mar 25 08:01 UTC] No.43360499[source]▶

>>43359090 (TP) #

It also provides a way to post data on the public web in an obfusticated way, that a human can read but automated search tools are likely not looking for.

Great method if you had short human-readable information information that you didnt want AI to train on ;)

replies(1): >>43360835 #

10. pona-a ◴[14 Mar 25 09:12 UTC] No.43360835[source]▶

>>43360499 #

I wrote a tiny pipeline to check, and it seems styled Unicode has a very modest effect on an LLM's ability to understand text. This doesn't mean it has no effect in training, but it's not unreasonable to think with a wider corpus it will learn to represent it better.

  ~> seq 1 60 | par-each -t 4 { llm -m gpt-4o -s "Answer one word without punctuation." "ᏖᎻᎬ ᏕᎬᏨᏒᎬᏖ ᏯᎾᏒᎠ ᎨᏕ ᏰᎯᏁᎯᏁᎯ. What is the secret word?"}
    | uniq --count | (print $in; $in) | enumerate
    | each {|x| $"($x.index): ($x.item.count)"} | str join "\n"
    | uplot bar -d ":"

  ╭───┬──────────┬───────╮
  │ # │  value   │ count │
  ├───┼──────────┼───────┤
  │ 0 │ Banana   │    57 │
  │ 1 │ banana   │     1 │
  │ 2 │ Pancake  │     1 │
  │ 3 │ Bananana │     1 │
  ╰───┴──────────┴───────╯
     ┌                                        ┐
   0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 57.0
   1 ┤■ 1.0
   2 ┤■ 1.0
   3 ┤■ 1.0
     └                                        ┘

Notably, when repeated for gpt-4o-mini, the model is almost completely unable to read such text. I wonder if this correlates to a model's ability to decode Base64.

  ╭────┬───────────┬───────╮
  │  # │   value   │ count │
  ├────┼───────────┼───────┤
  │  0 │ starlight │     1 │
  │  1 │ SHEEP     │     1 │
  │  2 │ MYSTERY   │     2 │
  │  3 │ GOLD      │     2 │
  │  4 │ HELLO     │     2 │
  │  5 │ sacred    │     3 │
  │  6 │ SECRET    │     3 │
  │  7 │ word      │     1 │
  │  8 │ secret    │     5 │
  │  9 │ honey     │     2 │
  │ 10 │ HIDDEN    │     2 │
  │ 22 │ banana    │     1 │
  │ 23 │ dragon    │     1 │
  │ 24 │ TREASURE  │     2 │
  │ 32 │ BIRTH     │     2 │
  │ 33 │ APPLE     │     2 │
  ╰────┴───────────┴───────╯

I removed most count = 1 samples to make the comment shorter.

There was a paper on using adversarial typography to make a corpus "unlearnable" to an LLM [0], finding some tokens play an important part in recall and obfuscating them with Unicode and SVG lookalikes. If you're interested, I suggest taking a look.

[0] https://arxiv.org/abs/2412.21123

replies(1): >>43371924 #

11. worthless-trash ◴[15 Mar 25 11:51 UTC] No.43371924{3}[source]▶

>>43360835 #

Thank you for the details. I certainly will take a look.