←back to thread

268 points aapoalas | 1 comments | | HN request time: 0.224s | source

We're building a different kind of JavaScript engine, based on data-oriented design and willingness to try something quite out of left field. This is most concretely visible in our major architectural choices:

1. All data allocated on the JavaScript heap is placed into a type-specific vector. Numbers go into the numbers vector, strings into the strings vector, and so on.

2. All heap references are type-discriminated indexes: A heap number is identified by its discriminant value and the index to which it points to in the numbers vector.

3. Objects are also split up into object kind -specific vectors. Ordinary objects go into one vector, Arrays go into another, DataViews into yet another, and so on.

4. Unordinary objects' heap data does not contain ordinary object data but instead they contain an optional index to the ordinary objects vector.

5. Objects are aggressively split into parts to avoid common use-cases having to reading parts that are known to be unused.

If this sounds interesting, I've written a few blog posts on the internals of Nova over in our blog, you can jump into that here: https://trynova.dev/blog/what-is-the-nova-javascript-engine

Show context
ianbicking ◴[] No.42174161[source]
I'm coming at this with no real JavaScript implementation knowledge, so these comments might not even make sense...

The data sorting seems quite cleanly at first, but as I think more about it I don't quite get it. I guess you are saving a bit of space by segmenting by type... in another approach you might have the type on the pointer, and the pointer can point to anything, and so it's potentially a bit longer than having a type and pointer(/index) that points into a smaller portion of memory specific to that type. But enough to matter?

"No, pointers we do not want and cannot have, so the only real option is to use indexes. Indexes have a lot of benefits: They are small, work exceedingly well together with our heap vectors, enable using the same value to index into multiple heap vectors (or slices of the same heap vector), perform a form of pointer compression automatically, and offer great protection from safety vulnerabilities as reinterpreting an index as a different type changes both the type and the memory it indexes into."

That all just sounds like a pointer to me? The last case also seems like a security hole, not protection.

"Not all objects are the same: They differ in their usage and their capabilities. An object-oriented reading of JavaScript objects' capabilities and the ECMAScript specification would give you a clear and simple inheritance graph where the ordinary object is the base object class, and Arrays, DataViews, Maps, and others inherit from that. Not all objects are the same: They differ in their usage and their capabilities. An object-oriented reading of JavaScript objects' capabilities and the ECMAScript specification would give you a clear and simple inheritance graph where the ordinary object is the base object class, and Arrays, DataViews, Maps, and others inherit from that."

It seems like you are special-casing a specific set of object types (like Array), which is very justifiable. So sure.

"This is somewhat more of an aim for the future instead of current reality, but allow me to give some easy examples: The ArrayBuffer object in ECMAScript supports allocating up to 2^53 bytes of data. Most engines only allow a tad bit over 2^32 bytes but nevertheless, the fact of the matter is that you need more than 4 bytes to store that byte value. As a result, ArrayBuffer itself but also DataView and all the various TypedArray variants like Uint8Array must carry within them 8 byte data fields for byte offset, byte length, and even array length. Now ask yourself, how often do you deal with ArrayBuffers larger than 4 GiB? Not very often, obviously."

I'm guessing this is leading to a decision many languages have made about numbers and strings, where there's special types for small numbers and short strings (exposed only in the implementation). Or even more special types, where the pointers become values.

Also I can see a benefit to keeping track of "normal" Arrays and whatnot, so some of JavaScripts weird-but-not-usually-used behavior can be isolated, and normal behavior fast-tracked.

"In Nova we aim to split objects into parts to ensure that computationally unconnected parts are also stored separately in memory"

But this I don't get. If you are splitting things by type, how can you cluster them by how they are related? An object like {a: 1, b: 2} is an object with two strings and two numbers, presumably spread out over three different type-specific heaps?

replies(1): >>42175066 #
1. aapoalas ◴[] No.42175066[source]
Hey, thank you for the comment! I'll try to answer as best I can.

A pointer is 64 bits, though carrying much less useful payload than that. A JavaScript engine only rarely deals with more than 4 GiB of memory, so a 32 bit integer would be enough to index the entire memory needed. If you turn that though into indexes, a 32 bit index can speak of 4 billion separate items: Most programs never have that many distinct heap items alive at the same time. Note that this index doesn't now really correspond to indexable memory so we're no longer bound by the 4 GiB limit.

We actually do keep the 64 bit Value though! We just use the massive amounts of data to store a lot of data on the stack, avoiding heap allocations altogether.

> That just sounds like a pointer.

A pointer points to one place and one place only: An index can points to as many places as there are "parallel vectors" associated with it. eg. Think of a table: A row index refers to as many cells as there are columns, whereas a cell pointer only identifies one cell.

> The last case also seems like a security hole, not protection.

Usually JS engines don't consider the JS-accessible contents of the JS heap itself part of the threat model: Any object in the heap is liable to be leaked by the JS code running in the engine anyway. eg. V8's object placement is fairly static and easy to exploit. The important thing for safety is to avoid type confusion which can be used to create read/write primitives to punch out of the sandbox. So; an attacker can freely read through the heap data by creating heap indexes out of thin air but they cannot use that to reinterpret one type of data as another type and then feed that back to the engine to cause it to misbehave.

> But this I don't get. If you are splitting things by type, how can you cluster them by how they are related? An object like {a: 1, b: 2} is an object with two strings and two numbers, presumably spread out over three different type-specific heaps?

Yes, this would split into the ordinary object vector, and the object property vector. If the keys were longer they'd end up in the strings vector and if the values were heap allocated doubles then they'd end up in yet another vector. Looking at it one thing at a time, it is split here and there.

That being said, this doesn't really much change from how traditional engines do it: Strings are not going to be near the objects that use them as keys, nor are heap numbers, and (added) properties also go into a separate backing store which is likely not next to the object. Worst of all, even if all of these were next to the object, they'd span multiple cache lines and wouldn't really benefit from being close to each other as they're pointer chased and thus wouldn't get much guarantees of prefetching.

When you look at multiple objects, however, then you'll see that Nova's object data is still found in those 4 vectors, whereas the traditional engine design... It may have tried it's best to keep the data together but it's probably still spread out here and there. And you're loading all unnecessary stuff like the elements pointer (for indexed properties) and any other inline properties etc. together with the properties that you actually wanted to read.

Sorry, this ended up a bit disjointed. Let me know if you have more questions! Thanks.