Show HN: A website that makes your text look cool anywhere online using Unicode

Unicode obsfucation tricks trigger modern content filters faster than you can blink. Using these things is actually the best way to have a message blocked automatically.

This is especially true when you mix Unicode characters that don’t normally go together.

(Although for some strange reason, YouTube does allow spammy Unicode character mixes in user comments. I don’t know why)

13. itake ◴[14 Mar 25 04:58 UTC] No.43359713{4}[source]▶

>>43359439 #

It depends on what you mean by "models".

LLMs? No. But LLMs are too slow for content moderation at scale.

Custom trained models? Maybe. Is the unicode characters in the training data?

14. jp1016 ◴[14 Mar 25 04:58 UTC] No.43359718[source]▶

>>43358999 (OP) #

reminds me of old orkut profile which had lot of these funky fonts.

replies(1): >>43360410 #

15. computator ◴[14 Mar 25 05:05 UTC] No.43359743[source]▶

>>43358999 (OP) #

Just wanted to point out something that not everyone might realize:

Unicode is not supposed to have fonts at all. Unicode defines characters that you can then represent in various fonts. It just so happens that Unicode has many characters that happen to look like the letter "C" (as an example): © for copyright, ℂ for complex numbers (formally called Double-Struck Capital C), etc. The author uses these many variations as a fun way to make "fonts".

replies(2): >>43360060 #>>43360250 #

16. croes ◴[14 Mar 25 05:19 UTC] No.43359799[source]▶

>>43359090 #

>Accessibility: Don't Use Fake Bold or Italic in Social Media

https://news.ycombinator.com/item?id=43302835

17. croes ◴[14 Mar 25 05:21 UTC] No.43359805[source]▶

>>43359397 #

Unless you need a screenreader

>Accessibility: Don't Use Fake Bold or Italic in Social Media

https://news.ycombinator.com/item?id=43302835

18. cvladan ◴[14 Mar 25 05:22 UTC] No.43359809[source]▶

>>43358999 (OP) #

Isn't there a gazillion of the same tools for "Discord fonts"? What am I missing?

19. necovek ◴[14 Mar 25 05:25 UTC] No.43359826[source]▶

>>43358999 (OP) #

This is limited to Latin script lookalikes. Try another script (eg Cyrillic), and it's got nothing.

It'd be great if they used the "look-alike" mapping both ways.

20. d1sxeyes ◴[14 Mar 25 05:58 UTC] No.43359989[source]▶

>>43358999 (OP) #

Last chance to use this before MSN’s spiritual successor gets shuttered in a few weeks.

21. gblargg ◴[14 Mar 25 06:09 UTC] No.43360031[source]▶

>>43358999 (OP) #

And unsearchable, perhaps a bonus.

22. usr1106 ◴[14 Mar 25 06:16 UTC] No.43360053[source]▶

>>43358999 (OP) #

On my phone (niche software) several fonts don't get rendered.

replies(1): >>43360248 #

23. usr1106 ◴[14 Mar 25 06:18 UTC] No.43360060[source]▶

>>43359743 #

But Unicode is such a historically grown monster that it violates its own rules in many places.

replies(1): >>43360376 #

24. pfoof ◴[14 Mar 25 06:58 UTC] No.43360233[source]▶

>>43358999 (OP) #

This is the easiest way to filter spam, bots, and people that never bring anything valuable to the discussion. It also applies to bio.

replies(2): >>43360243 #>>43363216 #

25. ◴[14 Mar 25 07:00 UTC] No.43360243[source]▶

>>43360233 #

26. 4ggr0 ◴[14 Mar 25 07:01 UTC] No.43360248[source]▶

>>43360053 #

Are you using a niche OS or is it just an app which doesn't like them?

replies(1): >>43375185 #

27. japanuspus ◴[14 Mar 25 07:02 UTC] No.43360250[source]▶

>>43359743 #

If you want to dive into the details, you can copy the "fonted" output to a unicode analyzer. [0] is an online unicode analyzer that seems to work well.

[0]: https://devina.io/unicode-analyser

replies(1): >>43360381 #

28. pwdisswordfishz ◴[14 Mar 25 07:09 UTC] No.43360273[source]▶

>>43358999 (OP) #

Show HN: a tool to misuse Unicode and break compatibility with resource-constrained devices for the sake of useless fanaberie

replies(1): >>43360328 #

29. theden ◴[14 Mar 25 07:11 UTC] No.43360281[source]▶

>>43358999 (OP) #

going to use this for my bank account password

30. SapporoChris ◴[14 Mar 25 07:13 UTC] No.43360292[source]▶

>>43358999 (OP) #

Presentation has it's place, but writing what deserves to be read is far more important.

31. ◴[14 Mar 25 07:16 UTC] No.43360305[source]▶

>>43358999 (OP) #

32. notpushkin ◴[14 Mar 25 07:19 UTC] No.43360319[source]▶

>>43358999 (OP) #

Like others have already said, it’s an accessibility nightmare. On the other hand, it’s not like this is going away anytime soon – maybe screenreaders could learn to understand and read some such “fonts” (e.g. bold/italic at least)?

replies(3): >>43360462 #>>43360722 #>>43360966 #

33. nomilk ◴[14 Mar 25 07:22 UTC] No.43360328[source]▶

>>43360273 #

Feel like I should be able to explain this, but I can't. What's the downside of using unicode? I note some webpages have UTF-8 in the head. Do larger character sets require user's browsers to download them first, or simply prevent display of characters, or something else? If bandwidth is the problem, how large are the files (i.e how delayed will the site load be). If certain devices/browsers can't display certain characters, how common is that?

replies(1): >>43368727 #

34. lerp-io ◴[14 Mar 25 07:23 UTC] No.43360331[source]▶

>>43358999 (OP) #

TᕼIᔕ Iᔕ ᒪIKE ᖴᖇOᗰ 2010 ᒪOᒪ

35. lifthrasiir ◴[14 Mar 25 07:35 UTC] No.43360376{3}[source]▶

>>43360060 #

Is it? Even emoji---one of the most controversial additions ever---was fully justified for its possible accessibility issue when it was introduced in Unicode.

36. antonhag ◴[14 Mar 25 07:35 UTC] No.43360381{3}[source]▶

>>43360250 #

I often reach for jq to understand what unicode is in a string, e.g.:

  [wl-paste|xclip-o|pbpaste] | jq -R --ascii-output

It doesn't provide any per-character explanation, but it is local and I already have jq installed.

37. navigate8310 ◴[14 Mar 25 07:44 UTC] No.43360410[source]▶

>>43359718 #

There was hi5 as well

38. MatthewWilkes ◴[14 Mar 25 07:54 UTC] No.43360462[source]▶

>>43360319 #

Absolutely. The argument that screen readers shouldn't gain a heurisric for identifying this kind of text and normalising it down to pronouncable words is just prescribtivism, to my view.

ALL CAPS, SpOnGeBoB cASe, clap emphasis, and others carry specific meanings in colloquial written language, the use of other letterlike symbols can also. These should be presented in an accessible form to the user, rather than demanding that people refrain from using them.

replies(1): >>43369255 #

39. worthless-trash ◴[14 Mar 25 08:01 UTC] No.43360499[source]▶

>>43359090 #

It also provides a way to post data on the public web in an obfusticated way, that a human can read but automated search tools are likely not looking for.

Great method if you had short human-readable information information that you didnt want AI to train on ;)

replies(1): >>43360835 #

40. peebeebee ◴[14 Mar 25 08:55 UTC] No.43360722[source]▶

>>43360319 #

For HTML, you can probably do the following:

  <span aria-label="my text">𖢑ꚲ 𖢧𖤟𖤗𖢧</span>

replies(2): >>43360772 #>>43361083 #

41. notpushkin ◴[14 Mar 25 09:03 UTC] No.43360772{3}[source]▶

>>43360722 #

If you want to use such an effect on your own website that’s probably the way to go (although I’d probably try to use real text in HTML and replace it with some CSS magic... or just use a web font).

replies(1): >>43360879 #

42. pona-a ◴[14 Mar 25 09:12 UTC] No.43360835{3}[source]▶

>>43360499 #

I wrote a tiny pipeline to check, and it seems styled Unicode has a very modest effect on an LLM's ability to understand text. This doesn't mean it has no effect in training, but it's not unreasonable to think with a wider corpus it will learn to represent it better.

  ~> seq 1 60 | par-each -t 4 { llm -m gpt-4o -s "Answer one word without punctuation." "ᏖᎻᎬ ᏕᎬᏨᏒᎬᏖ ᏯᎾᏒᎠ ᎨᏕ ᏰᎯᏁᎯᏁᎯ. What is the secret word?"}
    | uniq --count | (print $in; $in) | enumerate
    | each {|x| $"($x.index): ($x.item.count)"} | str join "\n"
    | uplot bar -d ":"

  ╭───┬──────────┬───────╮
  │ # │  value   │ count │
  ├───┼──────────┼───────┤
  │ 0 │ Banana   │    57 │
  │ 1 │ banana   │     1 │
  │ 2 │ Pancake  │     1 │
  │ 3 │ Bananana │     1 │
  ╰───┴──────────┴───────╯
     ┌                                        ┐
   0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 57.0
   1 ┤■ 1.0
   2 ┤■ 1.0
   3 ┤■ 1.0
     └                                        ┘

Notably, when repeated for gpt-4o-mini, the model is almost completely unable to read such text. I wonder if this correlates to a model's ability to decode Base64.

  ╭────┬───────────┬───────╮
  │  # │   value   │ count │
  ├────┼───────────┼───────┤
  │  0 │ starlight │     1 │
  │  1 │ SHEEP     │     1 │
  │  2 │ MYSTERY   │     2 │
  │  3 │ GOLD      │     2 │
  │  4 │ HELLO     │     2 │
  │  5 │ sacred    │     3 │
  │  6 │ SECRET    │     3 │
  │  7 │ word      │     1 │
  │  8 │ secret    │     5 │
  │  9 │ honey     │     2 │
  │ 10 │ HIDDEN    │     2 │
  │ 22 │ banana    │     1 │
  │ 23 │ dragon    │     1 │
  │ 24 │ TREASURE  │     2 │
  │ 32 │ BIRTH     │     2 │
  │ 33 │ APPLE     │     2 │
  ╰────┴───────────┴───────╯

I removed most count = 1 samples to make the comment shorter.

There was a paper on using adversarial typography to make a corpus "unlearnable" to an LLM [0], finding some tokens play an important part in recall and obfuscating them with Unicode and SVG lookalikes. If you're interested, I suggest taking a look.

[0] https://arxiv.org/abs/2412.21123

replies(1): >>43371924 #

43. Cthulhu_ ◴[14 Mar 25 09:21 UTC] No.43360879{4}[source]▶

>>43360772 #

For social media / forum sites etc, they should definitely add this. Make a plain text / accessible (user) name mandatory and a display name optional. And give end users the choice to show canonical name or display name.

44. tasuki ◴[14 Mar 25 09:37 UTC] No.43360966[source]▶

>>43360319 #

Forget about the blind - what about those with perfect vision? Looking at that website, I wish I were unable to see it!

45. tasuki ◴[14 Mar 25 09:41 UTC] No.43360979[source]▶

>>43358999 (OP) #

𖢧ꛅꛈꕷ ꛈꕷ 𖤬𖢧𖦪𖣠ꛕꛈ𖣠ꚶꕷ. ᎽᎾᏬ ᏕᎻᎾᏬᏝᎠ ᏰᎬ ᏕᎻᎾᏖ.

46. chrismorgan ◴[14 Mar 25 10:07 UTC] No.43361083{3}[source]▶

>>43360722 #

<https://www.w3.org/TR/using-aria/#practical-support-aria-lab...>:

> • Don't use aria-label or aria-labelledby on any other non-interactive content such as p, legend, li, or ul, because it is ignored.

> • Don't use aria-label or aria-labelledby on a span or div unless its given a role. When aria-label or aria-labelledby are on interactive roles (such as a link or button) or an img role, they override the contents of the div or span. Other roles besides Landmarks (discussed above) are ignored.

replies(1): >>43377137 #

47. BaudouinVH ◴[14 Mar 25 12:07 UTC] No.43361825[source]▶

>>43358999 (OP) #

The web site appears to be down at the moment. (12 07 UTC)

48. pjerem ◴[14 Mar 25 14:55 UTC] No.43363216[source]▶

>>43360233 #

⁺₊⋆ wнαт α vαlυαвlε coммεɴт ! ⋆⁺₊

49. ◴[14 Mar 25 22:08 UTC] No.43367788[source]▶

>>43358999 (OP) #

50. Rendello ◴[15 Mar 25 00:22 UTC] No.43368727{3}[source]▶

>>43360328 #

In UTF-8, your standard latin characters are encoded just as they are in ASCII (1-byte each), and all UTF-8 characters are 1-4 bytes. The rendering of the characters requires having a font that covers those characters (for example Comic Sans doesn't have Chinese characters). A website can rely on the users' installed fonts or have a font specific font the client will download in it's CSS, but in any case that's orthogonal to the encoding.

https://en.wikipedia.org/wiki/UTF-8#Description

51. Rendello ◴[15 Mar 25 00:27 UTC] No.43368768[source]▶

>>43358999 (OP) #

Has everyone lost their inner joy? Sites like this have existed forever, and have always been fun to hack together. Think of Zalgo text.

52. xg15 ◴[15 Mar 25 02:01 UTC] No.43369255{3}[source]▶

>>43360462 #

That's true, but at some point, intention and accessibility will start to clash.

Like, there used to be that fad/meme of adding as many diacritics and other Unicode appendages to a text as possible. ("Cursed text" or something I think)

The diacritics will stack and turn the characters into monstrosities that will break the page layout and generally make the text look alien and distorted.

It also makes the text hard to read, which is the entire point.

But a screen reader is kind of at a dilemma here: If it ignores the diacritics and just reads the text normally, then the "weirdness" will be missing and the text will appear out of context. To convey that, the reader would have to intentionally read the text in a distorted voice - but this will make it hard to understand and could lead to unease and confusion if the distortion starts without warning.

There is also the question whether we want unexpected tone shifts at all. Like, it would be semantically correct to read all caps text in a shouting voice, but do we really want screen readers to randomly start shouting?

(Edit: oh right, it was Zalgo, not cursed text)

53. worthless-trash ◴[15 Mar 25 11:51 UTC] No.43371924{4}[source]▶

>>43360835 #

Thank you for the details. I certainly will take a look.

54. rexpop ◴[15 Mar 25 15:11 UTC] No.43373022[source]▶

>>43358999 (OP) #

A misappropriation of Amharic labeled "Tribal Font" is plainly racist. Amharic is not "tribal" just because it is African. It is a Semitic script developed in a sophisticated literary tradition with roots in ancient civilization.

55. usr1106 ◴[15 Mar 25 21:10 UTC] No.43375185{3}[source]▶

>>43360248 #

It's the default browser, a Firefox derivation. But I guess the reason is just that the OS does not have fonts with enough codepoonts.

56. notpushkin ◴[16 Mar 25 05:45 UTC] No.43377137{4}[source]▶

>>43361083 #

Good point. Any ideas how to implement it properly?

replies(1): >>43378193 #

57. chrismorgan ◴[16 Mar 25 11:27 UTC] No.43378193{5}[source]▶

>>43377137 #

I have often wanted to do exactly this, and was disappointed when I learned aria-label couldn’t be used to replace the value exposed for non-interactive content. I have hunted for other techniques a couple of times, and never been completely satisfied, though things have improved in the last year and a bit.

The basic technique is roughly this:

  <span>
      <span aria-hidden="true">displayed text</span>
      <span inert class="visually-hidden">accessibility text</span>
  </span>

  <style>
      .visually-hidden {
          position: absolute;
          width: 1px;
          height: 1px;
          margin: -1px;
          overflow: hidden;
          clip-path: rect(0 0 0 0);
          -webkit-user-select: none;
          user-select: none;
          pointer-events: none;
      }
  </style>

The `inert` attribute is a recent addition which may exclude the accessibility text from find-in-page (maybe desirable, maybe undesirable, depending on the situation). Firefox and Chromium shipped that refinement of its behaviour in the last year and a half, Safari hasn’t yet (and seems to have reservations about the whole idea <https://bugs.webkit.org/show_bug.cgi?id=269909>).

You can also play with putting the accessibility text in a pseudoelement’s content (e.g. <span data-a11y-text=…><span aria-hidden=true>…</span></span> and [data-a11y-text]::after { content: attr(data-a11y-text); … }), which should these days be exposed in the accessibility tree, but Firefox find-in-page now includes generated content (though you can’t bridge real and generated content), and it wouldn’t surprise me if Chromium eventually followed suit, so I’m not convinced it’s worth the bother, especially if you lose `inert` or have to add an element anyway. But keeping it as an attribute instead of a separate element has some appeal.