Most active commenters
  • Joker_vD(3)

←back to thread

331 points willm | 14 comments | | HN request time: 1.296s | source | bottom
1. rivo ◴[] No.41216673[source]
It's funny how every TUI developer eventually stumbles over Unicode and then handling international characters and emojis correctly turns into its own project close to the same scope of (or even bigger than) the original TUI project. It happened to me on rivo/tview and through the resulting rivo/uniseg package, I learned that all other TUI library maintainers deal with the same issues. Finally, everyone invents their own unique solutions to the problem because character width is not standardized and terminals are messy, as noted in the article. OP simply supports Unicode 9 only (Unicode is at version 15.1 at the moment). Sooner or later, users will complain, however, that certain emojis or international characters are not rendered correctly. So I'm not sure that this is a great solution.
replies(2): >>41217049 #>>41218008 #
2. roland35 ◴[] No.41217049[source]
Are there any libraries in place which can normalize all emojis down to a single symbol?
replies(2): >>41217636 #>>41220089 #
3. kevindamm ◴[] No.41217636[source]
It's a design decision. On one end, if I'm reading your question correctly, you could use 0xFFFD (the replacement character) for anything not recognized as language-specific characters in the BMP and SMPs (this can be done within practically all existing Unicode libraries by filtering on character class) which will inadvertantly filter some non-emoji symbols and doesn't really convey any information (it can even look unprofessional, it reminds me a lot of the early web during the pre-unicode growing pains of poorly implemented i18n/l11n).

There are libraries like Unidecode[0py] [0go] [0js] which convert from unicode to ASCII text that might be easiest to include in a TUI. All the ones I looked at will convert emoji to `[?]` but many other characters are converted to that, too, including unknowns.

On the other end you can keep a running list of what you mean by emoji[1] and pattern match on those characters, then substitute for a representative emoji. But it will still pose some difficulty around what to choose for the representative symbol and how to make it fit nicely within a TUI. An example of a library for pattern-matching on emoji is emoji-test-regex-pattern[2] but you can see it is based on a txt file that needs to be updated to correspond with additions to Unicode.

[0py]: https://github.com/avian2/unidecode

[0go]: (actually there are a few of these) https://pkg.go.dev/github.com/gosimple/unidecode

[0js]: https://github.com/xen0n/jsunidecode

[1]: these aren't really contiguous ranges, and opinions vary, see https://en.m.wikipedia.org/wiki/Emoji#Unicode_blocks

[2]: https://github.com/mathiasbynens/emoji-test-regex-pattern

replies(1): >>41217966 #
4. estebank ◴[] No.41217966{3}[source]
There's a "trick" that works somewhat well for some compound emoji like "family": replace ZWJs with whitespace. Emoji width is not standardized because it depends on platform, fonts available, shell and terminal emulator, but almost no terminal supports compound emoji correctly. Because of how they were designed, most terminals will print the emoji as its component parts. If you need to do something like underline a piece of text (like rustc has to) we decompose them ourselves, and then it is a more tractable problem to know what the width of a char is (0, 1 or 2, and var width for tabs, which we just transform to a hardcoded 4—incorrect but usable). This can still be incorrect, on specific terminals, but works well enough on most.
5. rockorager ◴[] No.41218008[source]
The state of the art here is to detect mode 2027, and enable it when supported. This lets you know the terminal will handle graphemes properly.

I maintain two TUI libraries which use this technique and emoji support has been (nearly) great. (One of which uses your uniseg library!)

https://mitchellh.com/writing/grapheme-clusters-in-terminals

replies(3): >>41218432 #>>41220186 #>>41220385 #
6. sweeter ◴[] No.41218432[source]
really great read, thanks. Im a little disappointed that no terminal emulator both implements the Kitty image protocol, and mode 2027. I wish there would be a terminal project that would just pick the best standards we have at the moment. Im not a fan of sixel for a lot of reasons. Im looking forward to trying Ghostty, though.
replies(2): >>41218556 #>>41219418 #
7. rockorager ◴[] No.41218556{3}[source]
Ghostty has both mode 2027 and kitty image protocol
8. eNV25 ◴[] No.41219418{3}[source]
wezterm has both according to the above article
9. Joker_vD ◴[] No.41220089[source]
It doesn't matter; what matters is that both your (terminal-manipulating) program and terminal emulator agree on the symbols widths. Considering that they usually won't (lots of terminal emulators have their own hand-crafted, statically linked wcwidth/wcswidth functions; the readline library also has them hard-coded, by the way), it's quite frustrating.
replies(1): >>41222538 #
10. Joker_vD ◴[] No.41220186[source]
I personally don't think this mode is all that very useful, to be fair. First of all, the grapheme clustering is not set in stone, it's been changed from one Unicode standard version to other.

Second, and this is mostly because my personal use cases are very humble, a much, much simple to implement workaround, for everyone involved, would be a couple of OSC sequences which would mark a part of output text as the prompt (when terminal is in canonical/cooked mode), so that a huge chunk of readline could be simply thrown away.

So your program could just print a prompt, and then simply read the cooked line. In the meanwhile, the terminal emulator would handle line editing, line-wrapping and asynchronous output: if you keep outputing text to the terminal while a prompt is active, the terminal would clear the prompt and the unfinished line, print the text, then re-display the prompt and the line; basically what all "async readline" libraries do already with rl_clear/rl_redisplay — but doing it in the terminal would take care of this properly, because the terminal definitely knows how wide all the symbols it itself thinks are. And the tab completion could be supported by returning a <TAB>-terminated line to the program, instead of an <LF>-terminated line.

Unfortunately, I don't think something like this can actually become even moderately widely adopted.

Edit: Or, you know, maybe we could extend terminfo? Like, introduce twcswidth() function that would take your string, and the somehow encoded Unicode grapheme clustering data that the current terminal is actually using which you can query from terminfo, and return the number of screen cells it would take on this terminal.

replies(1): >>41220677 #
11. kevin_thibedeau ◴[] No.41220385[source]
That isn't sufficient. Codepoints with ambiguous width can't be detected in a standard way. A large number of pre-emoji symbols have been upgraded to have emoji presentation. Some systems default them to emojis with wide rendering, others maintain the text presentation with narrow rendering. Many systems ignore the presentation selectors if you want to force it.
12. ori_b ◴[] No.41220677{3}[source]
At this point, why not just write a graphical program?
replies(1): >>41221086 #
13. Joker_vD ◴[] No.41221086{4}[source]
Unfortunately, servers are usually configured without X forwarding enabled. And the functionality I am describing already exists for terminal-based programs, it's been reimplemented multiple times — see e.g. [0][1], it's just implemented with horrible hackery (by manually driving terminal, processing raw input and counting how much lines of text the terminal screen is probably displaying right now).

I just want to, e.g. write a simple Python program that has

    for line in streaming_response.lines():
        print(line)
in one thread, and

    while True:
        cmd = input('> ').strip()
        if cmd == 'q':
             break
        if cmd == 'stop':
             requests.post(...)
        ...
in another, and be able to input my commands without the echo of my input being teared up by the output. Erlang's shell can do that. Readline can be used to do that, but Python's bindings don't export the needed functions. Swapping out the sys.stdout/sys.stdin with my custom interceptors to do this manually... barely works, slow, ugly as hell and complicated.

[0] https://github.com/thejoshwolfe/consoline

[1] https://github.com/erlang/otp/blob/90a48ae2bff26d5df67ceaa7e...

14. ku1ik ◴[] No.41222538{3}[source]
This.