(dgl.cx)

178 points dgl | 1 comments | 24 Jun 25 02:13 UTC | HN request time: 0.528s | source

Show context

duped ◴[24 Jun 25 06:35 UTC] No.44363404[source]▶

In my fever dreams of maintaining utf8 supporting text widgets that work and never need to be updated, there's a zero-width whitespace grapheme cluster that represents the number of codepoints in the next grapheme cluster if they're different from the previous.

The situation today is basically the same as null terminated C strings. Except worse, because you can define that problem and solve it in linear time/space without needing to keep an up to date list of tables.

replies(3): >>44363439 #>>44363883 #>>44366323 #

account42 ◴[24 Jun 25 08:07 UTC] No.44363883[source]▶

>>44363404 #

This has nothing to do with UTF-8 which doesn't and shouldn't care about anything beyond mapping bytes to code points.

But even for adding it to Unicode, your proposal would make text stateful (even over long distances) which is a really bad idea.

replies(2): >>44363971 #>>44366527 #

1. CamouflagedKiwi ◴[24 Jun 25 08:23 UTC] No.44363971[source]▶

>>44363883 #

Combining characters have already made Unicode text stateful.

Although I agree that encoding length hints into it seems like a bad idea - it creates an opportunity for the encoding to disagree with the reality of the text. You need _some_ way of handling it if it says that the next grapheme cluster is 4 characters long but it's actually only three.

↑

Can your terminal do emojis? How big?