←back to thread

239 points ivankra | 1 comments | | HN request time: 0s | source
Show context
bcardarella ◴[] No.45945551[source]
Just a small comparison, compiled for release:

Boa: 23M Brimstone: 6.3M

I don't know if closing the gap on features with Boa and hardening for production use will also bloat the compilation size. Regardless, for passing 97% of the spec at this size is pretty impressive.

replies(3): >>45945705 #>>45945748 #>>45950091 #
jerf ◴[] No.45945748[source]
It looks like Boa has Unicode tables compiled inside of itself: https://github.com/boa-dev/boa/tree/main/core/icu_provider

Brimstone does not appear to.

That covers the vast bulk of the difference. The ICU data is about 10.7MB in the source (boa/core/icu_provider) and may grow or shrink by some amount in the compiling.

I'm not saying it's all the difference, just the bulk.

There's a few reasons why svelte little executables with small library backings aren't possible anymore, and it isn't just ambient undefined "bloat". Unicode is a big one. Correct handling of unicode involves megabytes of tables and data that have to live somewhere, whether it's a linked library, compiled in, tables on disks, whatever. If a program touches text and it needs to handle it correctly rather than just passing it through, there's a minimum size for that now.

replies(6): >>45945844 #>>45945976 #>>45946210 #>>45947165 #>>45947379 #>>45951653 #
rixed ◴[] No.45945976[source]
I was currious to see what that data consisted of and aparently that's a lot of translations, like the name of all possible calendar formats in all possible languages, etc. This seems useless in the vast majority of use cases, including that of a JS interpreter. Looks to me like the typical output of a comitee that's looking too hard to extend its domain.

Disclaimer: I never liked unicode specs.

replies(1): >>45946669 #
necovek ◴[] No.45946669[source]
Unicode is an attempt to encode the world's languages: there is not much to like or dislike about it, it only represents the reality. Sure, it has a number of weird details, butnif anything, it's due to the desire to simplify it (like Han unification or normal forms).

Any language runtime wanting to provide date/time and string parsing functions needs access to the Unicode database (or something of comparable complexity and size).

Saying "I don't like Unicode" is like saying "I don't like the linguistic diversity in the world": I mean sure, OK, but it's still there and it exists.

Though note that date-time, currency, number, street etc. formatting is not "Unicode" even if provided by ICU: this is similarly defined by POSIX as "locales", anf GNU libc probably has the richest collection of locales outside of ICU.

There are also many non-Unicode collation tables (think phonebook ordering that's different for each country and language): so no good sort() without those either.

replies(5): >>45947061 #>>45947293 #>>45947427 #>>45952195 #>>45952348 #
xeonmc ◴[] No.45947293[source]
Does that include emojis?
replies(1): >>45947489 #
jcranmer ◴[] No.45947489[source]
Emojis are complicated from a font rendering perspective. But from a string processing perspective, they're generally going to be among the simplest characters: they don't have a lot of complex properties with a lot of variation between individual characters. Compare something like the basic Latin characters, where the mappings for precomposed characters are going to vary wildly from 'a' to 'b' to 'c', etc., whereas the list of precomposed characters for the emoji blocks amounts to "none."
replies(2): >>45947902 #>>45950667 #
nine_k ◴[] No.45950667{5}[source]
Sorry, what? You mean, emoji composition rules are simpler than combining diacritics? https://blog.codepoints.net/emojis-under-the-hood.html
replies(1): >>45951489 #
1. necovek ◴[] No.45951489{6}[source]
I was unaware of this: thanks for pointing it out!