Biggest impact in performance was by moving to dot products.
Regrettably, the sheer size of the index of embeddings rendered it impractical to achieve the desired results.
1. https://gist.github.com/schappim/d4a6f0b29c1ef4279543f6b6740...
Biggest impact in performance was by moving to dot products.
Regrettably, the sheer size of the index of embeddings rendered it impractical to achieve the desired results.
1. https://gist.github.com/schappim/d4a6f0b29c1ef4279543f6b6740...
This whole thing looks like unreviewed AI. Stylish but fundamentally broken. I haven't had a chance to dig into the meat of the article yet, but unfortunately this is distracting enough that I'm not sure I will.
Edit: I'm digging into the meat, and it's better! Fortunately, it appears accurate. Unfortunately, it's rather repetitive. There's two paragraphs discussing the meaning of -1, 0, and +1 interleaved with multiple paragraphs explaining how cosine similarity allows vectors to be compared regardless of magnitude. The motivation is spread throughout the whole thing and repetitive, and the real world examples seem similar though formatted just differently enough to make it hard to tell at a glance.
To try to offer suggestions instead of just complaining... Here's my recommended edits:
I'd move the simple English explanation to the top after the intro, then remove everything but the diagrams until you reach the example. I'd completely remove the explanation of vectors unless you're going to include an explanation of dot products. I really like the functional approach, but feel like you could combine it with the `Math.hypot` example (leave the full formula as a comment, the rest is identical), and with the full example (although it's missing the `Math.hypot` optimization). Finally, I feel like you could get away with just one real web programming example, though don't know which one I'd choose. The last section about using OpenAI for embedding and it's disclaimer is already great!
Similarly, if you plan to query those vectors in search, you should consider continuous `TypedArray` types and smaller scalars than the double precision `number`.
I know very little about JS, but some of the amazing HackerNews community members have previously helped port SimSIMD to JavaScript (https://github.com/ashvardanian/SimSIMD), and I wrote a blog post covering some of those JS/TS-specifics, NumJS, and MathJS in 2023 (https://ashvardanian.com/posts/javascript-ai-vector-search/).
Hopefully, it should help unlock another 10-100x in performance.
It looks super great now. What you have here leaves an entirely different impression, and a stylish one!
Two last suggestions:
* Now I'm thinking the Why Cosine Similarity Matters for Modern Web Development section belongs at the top, right after your intro.
* The angle indicator is still a bit wonky in the diagram. I might even take direction only mode out entirely, as you point out cosine similarity is invariant to changes in magnitude.
While doing that, the next logical step is precalculating the squared magnitudes for each item, and in my small test case of <1000 items that sped the calculation up almost 300 times. The gains are not exponential, but economy on that constant for each pair considered is not insignificant, especially on small data sets (of course, with large sets a table won't cut it due to memory limitations and requires more sophisticated techniques).
I went to the typescript tag and tried to read a few other articles and got 404 errors. Just wanted to let ya know.
Nice blog and good work!
I'm reminded of that old Eugene Wigner quote: "The most incomprehensible thing about the universe is that it is comprehensible."
It's much more interesting that almost any set of ML-adjacent vectors can be somewhat reasonably compared via cosine distance _even without_ explicitly constructing an optimal embedding. It's not at all intuitive to me that an autoencoder's interior layer should behave well with respect to cosine similarity and not have any knots or anything warping that (associated) metric's usefulness.
(a @ b.T)/(np.linalg.norm(a)*np.linalg.norm(b))
If you really wanted to be pretentious you could invent a Dimension Insensitive Euclidean Metric and call it DIEM to make it sound like you are putting some Latin into your papers.
I wonder though- isn't classic TF-IDF scoring also a type of basic vector comparison? What is the actual difference between "vector database" and something like elastic search?
Tbh, I would argue that's also pretty surprising, as Euclidean distance is notoriously unintuitive[1] (and noisy) in higher dimensions. (I guess norming does help, so that's likely a good point.)
And I think today it is preferred to call it "similarity" instead of "distance" because it is not a true distance function (15 years ago we called it "cosine distance").
> in ML people renamed 'the angle between vectors' into something as fancy-feeling as 'cosine similarity'.
Since cosine similarity is a name derived from linear algebra
- There's a well-known result that says you can embed any finite, undirected, unweighted graph (with shortest-path as the metric of choice) in Euclidean space with at most sqrt(2) multiplicative error (~41% relative error). The proof of that, as I recall, basically shows that the worst-case embedding is any graph with a 4-vertex cycle, from which the result immediately follows.
- Construct the complete graph where each vertex is one of your (finitely many) vectors.
- Let G(0) denote that graph.
- Greedily construct a sequence of graphs G(1), G(2), ... by adding one extra vertex splitting some edge. The edge you split is the one that minimizes skew between your metric's definition of distance between your original set of vectors and the new distance defined using shortest paths. When more than one equivalent split exists, choose one.
- Note that G(N) consists of better and better rational approximations of the desired distances, approaching 0 skew as N approaches infinity.
- Using that well-known result, we can embed G(N) with at most ~41% relative error, and as N grows the extra relative error approaches 0, so multiplying those together you get ~41%.
- Delete the extra vertices from G(N) to obtain the embedding for your original vectors.
- This isn't strictly necessary, but you can (I'm 90% sure -- I'd have to work through these details very carefully to be positive) obtain sqrt(2) exactly (instead of just less than sqrt(2)+epsilon for every positive epsilon) in a few ways. The one I'm most familiar with falls back to logical compaction, basically taking all the math that exists for the hyperreals and using that to create a nonstandard space holding all of these graphs. The limit approaching 0 can be phrased as a first-order statement with consequences that allow you to conclude that there exists an embedding which actually attains the limit. Topological arguments of some sort would probably be a more common way to go about it.
- That last point (attaining sqrt(2)) follows from Bolzano-Weierstrass. Just unroll the embedding into an n^2-dimensional space if you have n points, and that sequence of embeddings (as a technical detail, center them on the origin to squeeze them all into a bounded space). The function (skew of that embedding) of the limit equals the limit of the function in this case, so there is some embedding attaining that bound.
I'm still not exactly sure how the result I'm basing this all on is proven.