Most of it involves taking advantage of data structure properties (and limits) by using zig comptime to derive functions that either compute offsets relative to existing pointers or use pre-computed offset tables, when relative isn't possible, to reduce function size further without inhibiting the ability to take full advantage of SIMD.
One of the next task for this is statically computing update graphs for archetypes such that a multi-thread runtime can mix strategies (last thread (detected by an atomic counter on nodes that require all dependencies to be complete) to reach a node broadcasts new work it unblocks, starved threads steal work from others, etc) to speed up the world update loop when running on larger targets while also remaining lock-free.
It's fun to explore how far one can go with statically declaring all limits upfront and managing even larger targets (steamdeck, servers) as if they were embedded applications.