Boa: 23M Brimstone: 6.3M
I don't know if closing the gap on features with Boa and hardening for production use will also bloat the compilation size. Regardless, for passing 97% of the spec at this size is pretty impressive.
Boa: 23M Brimstone: 6.3M
I don't know if closing the gap on features with Boa and hardening for production use will also bloat the compilation size. Regardless, for passing 97% of the spec at this size is pretty impressive.
Brimstone does not appear to.
That covers the vast bulk of the difference. The ICU data is about 10.7MB in the source (boa/core/icu_provider) and may grow or shrink by some amount in the compiling.
I'm not saying it's all the difference, just the bulk.
There's a few reasons why svelte little executables with small library backings aren't possible anymore, and it isn't just ambient undefined "bloat". Unicode is a big one. Correct handling of unicode involves megabytes of tables and data that have to live somewhere, whether it's a linked library, compiled in, tables on disks, whatever. If a program touches text and it needs to handle it correctly rather than just passing it through, there's a minimum size for that now.
Disclaimer: I never liked unicode specs.
Any language runtime wanting to provide date/time and string parsing functions needs access to the Unicode database (or something of comparable complexity and size).
Saying "I don't like Unicode" is like saying "I don't like the linguistic diversity in the world": I mean sure, OK, but it's still there and it exists.
Though note that date-time, currency, number, street etc. formatting is not "Unicode" even if provided by ICU: this is similarly defined by POSIX as "locales", anf GNU libc probably has the richest collection of locales outside of ICU.
There are also many non-Unicode collation tables (think phonebook ordering that's different for each country and language): so no good sort() without those either.
Unfortunately, for a long time, POSIX system were uncommon on desktops, and most Unices do not provide a clean way to extend it from userland (though I believe GNU libc does).
Brimstone does try to use the minimal set of Unicode data needed for the language itself. But I imagine much of the difference with Boa is because of Boa's support for the ECMA-402 Internationalization API (https://tc39.es/ecma402/).
FWIW, they are not even "complicated" from a font rendering perspective: they're simple non-combining characters and they are probably never used in ligatures either (though nothing really stops you; just like you can have locale-specific variants with locl tables). It's basically "draw whatever is in a font at this codepoint".
Yes, if you want to call them out based on Unicode names, you need to have them in the database, and there are many of them, so a font needs to have them all, but really, they are the simplest of characters Unicode could have.
This is something I notice in small few-person or one-person projects. They don't have the resources to build complex architectures so the code ends up smaller, cleaner and easier to maintain.
The other way to look at it is that cooperation has an overhead.
[0]: The famous 80:20 rule. Or another claiming that each additional 9 in reliability (and presumably other aspects) takes the same amount of work.
I have no issue with my system using an extra 10mb for Ancient Egyptian capitalization to work correctly. Every single program including those rules is a lot more wasteful.
Most of us are not using computers to represent subtle variants of those cultural artifacts and therefore they should be left in some specialized libraries.
Computers are symbolic machines, after all, and many times we would be as good using only 16 symbols and typing our code on a keyboard with just that many keys. We can't have anything but 64bits floats in JS, but somehow we absolutely need to be able to tell between the "peso lourd argentin (1970–1983)" and the "peso argentin (1881–1970)"? And that to display a chemical concentration in millimole per liter in German one has to write "mmol/l"?
I get it, the symbolic machines need to communicate with humans, who use natural languages written in all kind of ways, so it's very nice to have a good way to output and input text. We wanted that way to not favor any particular culture and I can understand that. But how do you get from there to the amount of arcane specialized minute details in the ICU dataset is questionable.
Respectfully disagree, linguistic diversity isn't by definition impossible to create a good abstraction on top of; I think that it's more of a failure of this particular attempt.
Without this trove of data, you can't do something as simple as length(str) or uppercase(str) — even in a CLI if you want to line text up.
So yes, this database has a big chunk that represents rarely useful data like you mention. But majority of it is still generally useful.