Most active commenters

necovek(7)
rixed(3)

Popular/hot comments

>>45947902 #

←back to thread

Brimstone: ES2025 JavaScript engine written in Rust

(github.com)

Show context

bcardarella ◴[16 Nov 25 14:54 UTC] No.45945551[source]▶

>>45944337 (OP) #

Just a small comparison, compiled for release:

Boa: 23M Brimstone: 6.3M

I don't know if closing the gap on features with Boa and hardening for production use will also bloat the compilation size. Regardless, for passing 97% of the spec at this size is pretty impressive.

replies(3): >>45945705 #>>45945748 #>>45950091 #

jerf ◴[16 Nov 25 15:22 UTC] No.45945748[source]▶

>>45945551 #

It looks like Boa has Unicode tables compiled inside of itself: https://github.com/boa-dev/boa/tree/main/core/icu_provider

Brimstone does not appear to.

That covers the vast bulk of the difference. The ICU data is about 10.7MB in the source (boa/core/icu_provider) and may grow or shrink by some amount in the compiling.

I'm not saying it's all the difference, just the bulk.

There's a few reasons why svelte little executables with small library backings aren't possible anymore, and it isn't just ambient undefined "bloat". Unicode is a big one. Correct handling of unicode involves megabytes of tables and data that have to live somewhere, whether it's a linked library, compiled in, tables on disks, whatever. If a program touches text and it needs to handle it correctly rather than just passing it through, there's a minimum size for that now.

replies(6): >>45945844 #>>45945976 #>>45946210 #>>45947165 #>>45947379 #>>45951653 #

rixed ◴[16 Nov 25 15:52 UTC] No.45945976[source]▶

>>45945748 #

I was currious to see what that data consisted of and aparently that's a lot of translations, like the name of all possible calendar formats in all possible languages, etc. This seems useless in the vast majority of use cases, including that of a JS interpreter. Looks to me like the typical output of a comitee that's looking too hard to extend its domain.

Disclaimer: I never liked unicode specs.

replies(1): >>45946669 #

1. necovek ◴[16 Nov 25 17:16 UTC] No.45946669[source]▶

>>45945976 #

Unicode is an attempt to encode the world's languages: there is not much to like or dislike about it, it only represents the reality. Sure, it has a number of weird details, butnif anything, it's due to the desire to simplify it (like Han unification or normal forms).

Any language runtime wanting to provide date/time and string parsing functions needs access to the Unicode database (or something of comparable complexity and size).

Saying "I don't like Unicode" is like saying "I don't like the linguistic diversity in the world": I mean sure, OK, but it's still there and it exists.

Though note that date-time, currency, number, street etc. formatting is not "Unicode" even if provided by ICU: this is similarly defined by POSIX as "locales", anf GNU libc probably has the richest collection of locales outside of ICU.

There are also many non-Unicode collation tables (think phonebook ordering that's different for each country and language): so no good sort() without those either.

replies(5): >>45947061 #>>45947293 #>>45947427 #>>45952195 #>>45952348 #

2. ◴[16 Nov 25 18:07 UTC] No.45947061[source]▶

>>45946669 (TP) #

3. xeonmc ◴[16 Nov 25 18:37 UTC] No.45947293[source]▶

>>45946669 (TP) #

Does that include emojis?

replies(1): >>45947489 #

4. jcranmer ◴[16 Nov 25 19:04 UTC] No.45947489[source]▶

>>45947293 #

Emojis are complicated from a font rendering perspective. But from a string processing perspective, they're generally going to be among the simplest characters: they don't have a lot of complex properties with a lot of variation between individual characters. Compare something like the basic Latin characters, where the mappings for precomposed characters are going to vary wildly from 'a' to 'b' to 'c', etc., whereas the list of precomposed characters for the emoji blocks amounts to "none."

replies(2): >>45947902 #>>45950667 #

5. necovek ◴[16 Nov 25 19:56 UTC] No.45947902{3}[source]▶

>>45947489 #

Agreed!

FWIW, they are not even "complicated" from a font rendering perspective: they're simple non-combining characters and they are probably never used in ligatures either (though nothing really stops you; just like you can have locale-specific variants with locl tables). It's basically "draw whatever is in a font at this codepoint".

Yes, if you want to call them out based on Unicode names, you need to have them in the database, and there are many of them, so a font needs to have them all, but really, they are the simplest of characters Unicode could have.

replies(3): >>45948028 #>>45948245 #>>45951571 #

6. nicoburns ◴[16 Nov 25 20:13 UTC] No.45948028{4}[source]▶

>>45947902 #

"draw whatever is in a font at this codepoint" is doing quite a lot of work there. Some emoji fonts just embed a PNG which is easy. But COLRv1 fonts define an entire vector graphics imaging model which is similar what you need to render an SVG.

replies(1): >>45948157 #

7. kibwen ◴[16 Nov 25 20:33 UTC] No.45948157{5}[source]▶

>>45948028 #

Yes, but at this point we're completely outside the scope of Unicode, which has nothing to do with how anything actually gets drawn to the screen.

8. overfeed ◴[16 Nov 25 20:45 UTC] No.45948245{4}[source]▶

>>45947902 #

> they're simple non-combining characters

Skin-tone emoji's are combined characters: base emoji + tone.

replies(1): >>45951462 #

9. nine_k ◴[17 Nov 25 03:40 UTC] No.45950667{3}[source]▶

>>45947489 #

Sorry, what? You mean, emoji composition rules are simpler than combining diacritics? https://blog.codepoints.net/emojis-under-the-hood.html

replies(1): >>45951489 #

10. necovek ◴[17 Nov 25 07:15 UTC] No.45951462{5}[source]▶

>>45948245 #

TIL, thanks for pointing it out.

11. necovek ◴[17 Nov 25 07:18 UTC] No.45951489{4}[source]▶

>>45950667 #

I was unaware of this: thanks for pointing it out!

12. SkiFire13 ◴[17 Nov 25 07:35 UTC] No.45951571{4}[source]▶

>>45947902 #

To add to the skintone emojis example, country flags emojis are combined characters using two letter characters corresponding to the country code. The various "family" emojis are also combined characters of individual person emojis, and so on.

13. rixed ◴[17 Nov 25 09:47 UTC] No.45952195[source]▶

>>45946669 (TP) #

I am not questionning the goal of representing all the fine details of every possible languages and currencies and calendars in use anywhere at any time in the universe, that's a respectable achievment. I'm discussing the process that lead to a programming language interpreter needing, according to the comment I was replying to, to embed that trove of data.

Most of us are not using computers to represent subtle variants of those cultural artifacts and therefore they should be left in some specialized libraries.

Computers are symbolic machines, after all, and many times we would be as good using only 16 symbols and typing our code on a keyboard with just that many keys. We can't have anything but 64bits floats in JS, but somehow we absolutely need to be able to tell between the "peso lourd argentin (1970–1983)" and the "peso argentin (1881–1970)"? And that to display a chemical concentration in millimole per liter in German one has to write "mmol/l"?

I get it, the symbolic machines need to communicate with humans, who use natural languages written in all kind of ways, so it's very nice to have a good way to output and input text. We wanted that way to not favor any particular culture and I can understand that. But how do you get from there to the amount of arcane specialized minute details in the ICU dataset is questionable.

replies(1): >>45976775 #

14. animuchan ◴[17 Nov 25 10:22 UTC] No.45952348[source]▶

>>45946669 (TP) #

> Saying "I don't like Unicode" is like saying "I don't like the linguistic diversity in the world": I mean sure, OK, but it's still there and it exists.

Respectfully disagree, linguistic diversity isn't by definition impossible to create a good abstraction on top of; I think that it's more of a failure of this particular attempt.

replies(1): >>45952779 #

15. necovek ◴[17 Nov 25 11:44 UTC] No.45952779[source]▶

>>45952348 #

Care to point out a -- by your definition -- successful attempt to do it?

16. necovek ◴[19 Nov 25 07:27 UTC] No.45976775[source]▶

>>45952195 #

You bring up numbers, but you ignore the strings, another fundamental data type in all programming languages.

Without this trove of data, you can't do something as simple as length(str) or uppercase(str) — even in a CLI if you want to line text up.

So yes, this database has a big chunk that represents rarely useful data like you mention. But majority of it is still generally useful.

replies(1): >>45977089 #

17. rixed ◴[19 Nov 25 08:22 UTC] No.45977089{3}[source]▶

>>45976775 #

I may be wrong, but a cursory look at the data gave me the impression that the actual majority of that data was actually not related to dealing with commonplace string manipulations. Other than that, we probably agree.

replies(1): >>45985428 #

18. necovek ◴[19 Nov 25 21:24 UTC] No.45985428{4}[source]▶

>>45977089 #

The big one that's often ignored are collation tables: while there's the default in ISO 10646 IIRC, each region-language combo might have their specific overrides (imagine "ss" being sorted as a separate letter in German, and not as after "sr" and before "st", so it would be sa..., sb..., sr..., st..., ssa..., ssb... etc); and then Austrian German might have a different phonebook ordering.

↑