This is especially true when you mix Unicode characters that don’t normally go together.
(Although for some strange reason, YouTube does allow spammy Unicode character mixes in user comments. I don’t know why)
Unicode is not supposed to have fonts at all. Unicode defines characters that you can then represent in various fonts. It just so happens that Unicode has many characters that happen to look like the letter "C" (as an example): © for copyright, ℂ for complex numbers (formally called Double-Struck Capital C), etc. The author uses these many variations as a fun way to make "fonts".
>Accessibility: Don't Use Fake Bold or Italic in Social Media
ALL CAPS, SpOnGeBoB cASe, clap emphasis, and others carry specific meanings in colloquial written language, the use of other letterlike symbols can also. These should be presented in an accessible form to the user, rather than demanding that people refrain from using them.
Great method if you had short human-readable information information that you didnt want AI to train on ;)
~> seq 1 60 | par-each -t 4 { llm -m gpt-4o -s "Answer one word without punctuation." "ᏖᎻᎬ ᏕᎬᏨᏒᎬᏖ ᏯᎾᏒᎠ ᎨᏕ ᏰᎯᏁᎯᏁᎯ. What is the secret word?"}
| uniq --count | (print $in; $in) | enumerate
| each {|x| $"($x.index): ($x.item.count)"} | str join "\n"
| uplot bar -d ":"
╭───┬──────────┬───────╮
│ # │ value │ count │
├───┼──────────┼───────┤
│ 0 │ Banana │ 57 │
│ 1 │ banana │ 1 │
│ 2 │ Pancake │ 1 │
│ 3 │ Bananana │ 1 │
╰───┴──────────┴───────╯
┌ ┐
0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 57.0
1 ┤■ 1.0
2 ┤■ 1.0
3 ┤■ 1.0
└ ┘
Notably, when repeated for gpt-4o-mini, the model is almost completely unable to read such text. I wonder if this correlates to a model's ability to decode Base64. ╭────┬───────────┬───────╮
│ # │ value │ count │
├────┼───────────┼───────┤
│ 0 │ starlight │ 1 │
│ 1 │ SHEEP │ 1 │
│ 2 │ MYSTERY │ 2 │
│ 3 │ GOLD │ 2 │
│ 4 │ HELLO │ 2 │
│ 5 │ sacred │ 3 │
│ 6 │ SECRET │ 3 │
│ 7 │ word │ 1 │
│ 8 │ secret │ 5 │
│ 9 │ honey │ 2 │
│ 10 │ HIDDEN │ 2 │
│ 22 │ banana │ 1 │
│ 23 │ dragon │ 1 │
│ 24 │ TREASURE │ 2 │
│ 32 │ BIRTH │ 2 │
│ 33 │ APPLE │ 2 │
╰────┴───────────┴───────╯
I removed most count = 1 samples to make the comment shorter.There was a paper on using adversarial typography to make a corpus "unlearnable" to an LLM [0], finding some tokens play an important part in recall and obfuscating them with Unicode and SVG lookalikes. If you're interested, I suggest taking a look.
> • Don't use aria-label or aria-labelledby on any other non-interactive content such as p, legend, li, or ul, because it is ignored.
> • Don't use aria-label or aria-labelledby on a span or div unless its given a role. When aria-label or aria-labelledby are on interactive roles (such as a link or button) or an img role, they override the contents of the div or span. Other roles besides Landmarks (discussed above) are ignored.
Like, there used to be that fad/meme of adding as many diacritics and other Unicode appendages to a text as possible. ("Cursed text" or something I think)
The diacritics will stack and turn the characters into monstrosities that will break the page layout and generally make the text look alien and distorted.
It also makes the text hard to read, which is the entire point.
But a screen reader is kind of at a dilemma here: If it ignores the diacritics and just reads the text normally, then the "weirdness" will be missing and the text will appear out of context. To convey that, the reader would have to intentionally read the text in a distorted voice - but this will make it hard to understand and could lead to unease and confusion if the distortion starts without warning.
There is also the question whether we want unexpected tone shifts at all. Like, it would be semantically correct to read all caps text in a shouting voice, but do we really want screen readers to randomly start shouting?
(Edit: oh right, it was Zalgo, not cursed text)
The basic technique is roughly this:
<span>
<span aria-hidden="true">displayed text</span>
<span inert class="visually-hidden">accessibility text</span>
</span>
<style>
.visually-hidden {
position: absolute;
width: 1px;
height: 1px;
margin: -1px;
overflow: hidden;
clip-path: rect(0 0 0 0);
-webkit-user-select: none;
user-select: none;
pointer-events: none;
}
</style>
The `inert` attribute is a recent addition which may exclude the accessibility text from find-in-page (maybe desirable, maybe undesirable, depending on the situation). Firefox and Chromium shipped that refinement of its behaviour in the last year and a half, Safari hasn’t yet (and seems to have reservations about the whole idea <https://bugs.webkit.org/show_bug.cgi?id=269909>).You can also play with putting the accessibility text in a pseudoelement’s content (e.g. <span data-a11y-text=…><span aria-hidden=true>…</span></span> and [data-a11y-text]::after { content: attr(data-a11y-text); … }), which should these days be exposed in the accessibility tree, but Firefox find-in-page now includes generated content (though you can’t bridge real and generated content), and it wouldn’t surprise me if Chromium eventually followed suit, so I’m not convinced it’s worth the bother, especially if you lose `inert` or have to add an element anyway. But keeping it as an attribute instead of a separate element has some appeal.