Most active commenters

    ←back to thread

    178 points dgl | 15 comments | | HN request time: 1.155s | source | bottom
    1. b0a04gl ◴[] No.44362767[source]
    emoji width bugs mostly come down to how terminals interpret Unicode's "grapheme clusters" vs "codepoints" vs "display cells". emoji isn't one codepoint - it's often multiple joined by zero-width joiners, variation selectors, skin tone modifiers. so the terminal asks wcwidth(), gets 1 or 2, but the actual glyph might render wider or combine into a single shape.

    some emoji even change width depending on font. family emoji is like 7 codepoints, shows up as one glyph. most terminals don't track that. they just count codepoints and pray.

    unless terminal is using a grapheme-aware renderer and syncs with the font's shaping engine (like freetype or coretext), it'll always guess wrong. wezterm and kitty kinda parse it right often

    replies(4): >>44362822 #>>44363348 #>>44363640 #>>44363828 #
    2. crackalamoo ◴[] No.44362822[source]
    Yeah, unfortunately I feel like despite all the advances in Unicode tech, my modern terminal (MacOS) still bugs out badly with emojis and certain special characters.

    I'm not sure how/when codepoints matter for wcwidth: my terminal handles many characters with more than one codepoint in UTF-8, like é and even Arabic characters, just fine.

    replies(1): >>44363380 #
    3. duped ◴[] No.44363348[source]
    Why do you need to sync with the shaping engine?

    TBH grapheme clusters are annoying but day 1 learning material for a text display widget that supports beyond ascii. It honestly irks me how many things just fuck it up, because it's not an intractably hard problem - just annoying enough to be intractable for people that are lazy (*).

    (*) the actually hard problem with grapheme clusters is that they're potentially unbounded in length and the standard is mutable, so your wcwidth() implementation needs to be updated along with standards to stay valid, particularly with emoji. This basically creates a software maintenance burden out of aether.

    replies(2): >>44363609 #>>44364807 #
    4. o11c ◴[] No.44363380[source]
    `wcwidth` works by assigning all codepoints (strictly, code units of whatever size `wchar_t` is on your system, but thankfully modern Unixen are sane) a width of -1 (error), 0 (combining), 1 (narrow), or 2 (wide).

    `wcswidth` could in theory work across multiple codepoints, but its API is braindead and cannot deal with partial errors.

    This is all from the perspective of what the application expects to output. What the terminal itself does might be something completely different - decomposed Hangul in particular tends to lead to rendering glitches in curses-based terminal programs.

    This is also different from what the (monospace) font expects to be rendered as. At least it has the excuse of not being able to call the system's `wcwidth`.

    Note that it is always a mistake to call an implementation of `wcwidth` other than the one provided by the OS, since that introduces additional mismatches, unless you are using a better API that calculates bounds rather than an exact width. I posted an oversimplified sketch (e.g. it doesn't include versioning) of that algorithm a while back ...

    https://news.ycombinator.com/item?id=43851532

    replies(1): >>44363629 #
    5. inetknght ◴[] No.44363609[source]
    > This basically creates a software maintenance burden out of aether.

    So... basically all modern software?

    6. PhilipRoman ◴[] No.44363629{3}[source]
    As fallback, you can also just emit the character and see how far the cursor advanced via CSI 6n (try printf '\x1b[6n')
    replies(1): >>44363692 #
    7. Joker_vD ◴[] No.44363640[source]
    The main problem is not even if the terminal itself can track the grapheme width "correctly". It's a) the fonts suck; b) does the terminal user tracks the width correctly?

    About a): some fonts have the glyphs for e.g. the playing cards block that are 1.5 columns wide even though the code points themselves are defined to be Narrow. How do you render that properly? Then there are variation selectors: despite what some may think, they don't affect the East Asian Width of the preceding code point, so whether you print "\N{ALEMBIC}\N{VARIATION SELECTOR-15}" or "\N{ALEMBIC}\N{VARIATION SELECTOR-16}", it still, according to wcwidth(), takes 1 column; but fonts have glyphs that are, again, 1.5 and 2 cells wide.

    And then there is the elephant in the room problem b) which is management of cursor position. But the terminal, and the app that uses the terminal need to have exactly the same idea of where the cursor is, or e.g. readline can't reliably function, or colorful grep output. You need to know how many lines of text you've output (to be able to erase them properly), and whether the cursor is at the leftmost column (because of \b semantics) or at the rightmost column (because xenl is a thing) or neither. And no, requesting the cursor position report from the terminal doesn't really work, it's way too slow and it's interspersed with the user input.

    The TUI paradigm really breaks down completely the moment the client is unsure how its output affects the cursor movement in the terminal. And terminals don't help much either! Turning off autowrap is mostly useless (the excess output is not discared, it overwrites the rightmost column instead), the autobackwrap (to make \b go to the previous line from the leftmost column) is almost unsupported and has its own issues, there is no simple command/escape sequence to go to the rightmost column... Oh, and there is xenl behaviour, which has many different subtle variations, and which original VT100 didn't even properly have despite what terminfo manual page may tell you — you can try it with the terminal emulator mentioned in TFA for yourself: go to setup, press 4, 5, move with right arrow to the block 3 and turn the second bit in it on by pressing 6 so it looks like "3 0100", exit setup (what you did is put the temrinal into the local mode so you can input text to it from your keyboard and turned the autowrap on), then do ESC, print "[1;79Hab", do LINE-FEED, print "cd" — you'll see that there is an empty line which shouldn't really be there, and it is not there if you do e.g. printf "\033[1;1Hxx\033[1;79Hab\ncd" on xterm (ironic, given how xterm's maintainer prides themself on being very faithful to original VT100 behaviour) or any other modern terminal.

    8. o11c ◴[] No.44363692{4}[source]
    Doing that adds a lot of round trips, so you still really need to do the initial estimate.

    (also, probing for whether the terminal actually supports various features is nontrivial. At startup you can send the basic "identify the terminal" sequences (there are 2) and check the result with a timeout; subsequently you can make a request then follow it with the basic terminal id to see if you actually get what you requested. But remember you can get arbitrary normal input interspersed.)

    9. account42 ◴[] No.44363828[source]
    It's more down to whatever monospace font the terminal uses not having those emojis and the (likely proportional) font they come from giving them a different width.
    10. zarzavat ◴[] No.44364807[source]
    > Why do you need to sync with the shaping engine?

    GP explained already. Grapheme clusters ≠ glyphs. To find the number of glyphs you need the font.

    An emoji can render as one or two or three or more glyphs depending on what font the user has installed, because many emoji are formed by joining two or more emoji by a ZWJ)

    (Also even in a monospace font not all glyphs are of ﷽ equal width)

    replies(3): >>44365038 #>>44366271 #>>44366577 #
    11. layer8 ◴[] No.44365038{3}[source]
    It's not the font that is deciding how emoji sequences are rendered. The renderer may decide based on which characters exist in the available fonts, but it doesn't have to. Same for glyph width in terminals. It wasn’t uncommon for non-double-width-aware terminals to only draw half an emoji in a regular-width cell.
    replies(2): >>44365376 #>>44365586 #
    12. zarzavat ◴[] No.44365376{4}[source]
    How else are you going to render a sequence such as Emoji ZWJ Emoji other than as two glyphs, if no composed glyph is defined in the user's font? That's how it's supposed to be rendered, for backwards compatibility.
    13. kccqzy ◴[] No.44365586{4}[source]
    > It wasn’t uncommon for non-double-width-aware terminals to only draw half an emoji in a regular-width cell.

    And you are just describing bugs. This is not just an emoji issue: it will also fail to render CJK characters.

    14. Joker_vD ◴[] No.44366271{3}[source]
    > An emoji can render as one or two or three or more glyphs depending on what font the user has installed,

    And how the program that prints such emojis should deal with this? Like, how should e.g. readline handle the user pressing the Backspace key after inputting such an emoji after prompted for input? It needs to know precisely how many lines and columns user's input takes: a huge chunk of code in that library is devoted precisely to this, because simply emitting "\b \b" doesn't work.

    And if the user opens the terminal emulator's settings and changes the font, should the program be sent some signal to redraw the window, as it happens when the window size changes? E.g. that emoji was in a 10-columns wide edit field and so characters after it fit when that emoji was 1 column wide, but now it's 2 columns wide, so the ncurses should now trim the last character in that field.

    Or try this funny little experiment, for instance: resize your terminal to something like 30 cols by 5 rows and run "script -c bash temrinal_log.txt". Now hold "a" key until you enter enough "a"s that the shell prompt is no longer visible. Now hold Backspace until you've erased all "a" and cursor no longer moves. What do you see on the screen? Now press Ctrl-D to exit the "script" session, and study the transcript in temrinal_log.txt in a hex editor. Ponder on the mechanisms that bash (readline inside it, really) uses to implement line-editing.

    15. mminer237 ◴[] No.44366577{3}[source]
    Some monospace fonts do fit ﷽ into one normal-sized character by squeezing it into three mini lines.