Most active commenters

kaycebasques(4)
minimaxir(3)

Popular/hot comments

>>43964659 #
>>43964509 #

←back to thread

Embeddings are underrated (2024)

(technicalwriting.dev)

1. tyho ◴[12 May 25 15:51 UTC] No.43964392[source]▶

>>43963868 (OP) #

> The 2D map analogy was a nice stepping stone for building intuition but now we need to cast it aside, because embeddings operate in hundreds or thousands of dimensions. It’s impossible for us lowly 3-dimensional creatures to visualize what “distance” looks like in 1000 dimensions. Also, we don’t know what each dimension represents, hence the section heading “Very weird multi-dimensional space”.5 One dimension might represent something close to color. The king - man + woman ≈ queen anecdote suggests that these models contain a dimension with some notion of gender. And so on. Well Dude, we just don’t know.

nit. This suggests that the model contains a direction with some notion of gender, not a dimension. Direction and dimension appear to be inextricably linked by definition, but with some handwavy maths, you find that the number of nearly orthogonal dimensions within n dimensional space is exponential with regards to n. This helps explain why spaces on the order of 1k dimensions can "fit" billions of concepts.

replies(12): >>43964509 #>>43964649 #>>43964659 #>>43964705 #>>43964934 #>>43965081 #>>43965183 #>>43965258 #>>43965725 #>>43965971 #>>43966531 #>>43967165 #

2. kaycebasques ◴[12 May 25 16:01 UTC] No.43964509[source]▶

>>43964392 (TP) #

Oh yes, this makes a lot of sense, thank you for the "nit" (which doesn't feel like a nit to me, it feels like an important conceptual correction). When I was writing the post I definitely paused at that part, knowing that something was off about describing the model as having a dimension that maps to gender. As you said, since the models are general-purpose and work so well in so many domains, there's no way that there's a 1-to-1 correspondence between concepts and dimensions.

I think your comment is also clicking for me now because I previously did not really understand how cosine similarity worked, but then watched videos like this and understand it better now: https://youtu.be/e9U0QAFbfLI

I will eventually update the post to correct this inaccuracy, thank you for improving my own wetware's conceptual model of embeddings

replies(3): >>43964692 #>>43965033 #>>43965713 #

3. aaronblohowiak ◴[12 May 25 16:14 UTC] No.43964649[source]▶

>>43964392 (TP) #

>nearly orthogonal dimensions within n dimensional space

nit within a nit: I believe you intended to write "nearly orthogonal directions within n dimensional space" which is important as you are distinguishing direction from dimension in your post.

replies(1): >>43966937 #

4. PaulHoule ◴[12 May 25 16:15 UTC] No.43964659[source]▶

>>43964392 (TP) #

Note you don't see arXiv papers where somebody feeds in 1000 male gendered words into a word embedding and gets 950 correct female gendered words. Statistically it does better than chance, but word embeddings don't do very well.

https://nlp.stanford.edu/projects/glove/

there are a number of graphs where they have about N=20 points that seem to fall in "the right place" but there are a lot of dimensions involved and with 50 dimensions to play with you can always find a projection that makes the 20 points fall exactly where you want them fall. If you try experiments with N>100 words you go endlessly in circles and produce the kind of inconclusively negative results that people don't publish.

The BERT-like and other transformer embeddings far outperform word vectors because they can take into account the context of the word. For instance you can't really build a "part of speech" classifier that can tell you "red" is an adjective because it is also a noun, but give it the context and you can.

In the context of full text search, bringing in synonyms is a mixed bag because a word might have 2 or 3 meanings and the the irrelevant synonyms are... irrelevant and will bring in irrelevant documents. Modern embeddings that recognize context not only bring in synonyms but the will suppress usages of the word with different meanings, something the IR community has tried to figure out for about 50 years.

replies(4): >>43965006 #>>43965085 #>>43965683 #>>43965720 #

5. OJFord ◴[12 May 25 16:18 UTC] No.43964692[source]▶

>>43964509 #

I would think of it as the whole embedding concept again on a finer grained scale: you wouldn't say the model 'has a dimension of whether the input is king', instead the embedding expresses the idea of 'king' with fewer dimensions than would be needed to cover all ideas/words/tokens like that.

So the distinction between a direction and a dimension expressing 'gender' is that maybe gender isn't 'important' (or I guess high-information-density) enough to be an entire dimension, but rather is expressed by a linear combination of two (or more) yet more abstract dimensions.

6. osigurdson ◴[12 May 25 16:19 UTC] No.43964705[source]▶

>>43964392 (TP) #

You can't visualize it but you can certainly compute the euclidean distance. Tools like UMAP can be used to drop the dimensionality as well.

replies(2): >>43964891 #>>43965121 #

7. aswanson ◴[12 May 25 16:35 UTC] No.43964891[source]▶

>>43964705 #

Any good umap links?

replies(1): >>43965043 #

8. daxfohl ◴[12 May 25 16:39 UTC] No.43964934[source]▶

>>43964392 (TP) #

Wait, but if gender was composed of say two dimensions, then there'd be no way to distinguish between "the gender is different" and "the components represented by each of those dimensions are individually different", right?

replies(1): >>43965171 #

9. minimaxir ◴[12 May 25 16:46 UTC] No.43965006[source]▶

>>43964659 #

> The BERT-like and other transformer embeddings far outperform word vectors because they can take into account the context of the word.

In addition to being able to utilize attention mechanisms, modern embedding models use a form of tokenization such as BPE which a) includes punctuation which is incredibly important for extracting semantic meaning and b) includes case, without as much memory requirements as a cased model.

The original BERT used an uncased, SentencePiece tokenizer which is out of date nowadays.

replies(1): >>43965052 #

10. benatkin ◴[12 May 25 16:49 UTC] No.43965033[source]▶

>>43964509 #

> Machine learning (ML) has the potential to advance the state of the art in technical writing. No, I’m not talking about text generation models like Claude, Gemini, LLaMa, GPT, etc. The ML technology that might end up having the biggest impact on technical writing is embeddings.

This is maybe showing some age as well, or maybe not. It seems that text generation will soon be writing top tier technical docs - the research done on the problem with sycophancy will likely result something significantly better than what LLMs had before the regression to sycophancy. Either way, I take "having the biggest impact on technical writing" to mean in the near term. If having great search and organization tools (ambient findability and such) is going to steal the thunder from LLMs writing really good technical docs, it's going to need to happen fast.

replies(1): >>43965841 #

11. minimaxir ◴[12 May 25 16:50 UTC] No.43965043{3}[source]▶

>>43964891 #

For small datasets, the original UMAP package is fine: https://umap-learn.readthedocs.io/en/latest/

For large datasets (as the UMAP algorithm scales in exponential compute), you will need to use the GPU-accelerated UMAP from cuML. https://docs.rapids.ai/api/cuml/stable/api/#umap

12. PaulHoule ◴[12 May 25 16:51 UTC] No.43965052{3}[source]▶

>>43965006 #

I was working at a startup that was trying to develop foundation models around at time and BPE was such a huge improvement over everything else we'd tried at that time. We had endless meetings where people proposed that we use various embeddings that would lose 100% of the information for out-of-dictionary words and I'd point out that out-of-dictionary words (particularly from the viewpoint of the pretrained model) frequently meant something critical and if we lost that information up front we couldn't get it back.

Little did I know that people were going to have a lot of tolerance for "short circuiting" of LLMs, that is getting the right answer by the wrong path, so I'd say now that my methodology of "predictive evaluation" that would put an upper bound on what a system could do was pessimistic. Still I don't like giving credit for "right answer by wrong means" since you can't count on it.

13. drc500free ◴[12 May 25 16:54 UTC] No.43965081[source]▶

>>43964392 (TP) #

Is this because we can essentially treat each dimension like a binary digit, so we get 2^n directions we can encode? Or am I barking up totally the wrong tree?

replies(1): >>43965751 #

14. philipwhiuk ◴[12 May 25 16:54 UTC] No.43965085[source]▶

>>43964659 #

> In https://nlp.stanford.edu/projects/glove/ there are a number of graphs where they have about N=20 points that seem to fall in "the right place" but there are a lot of dimensions involved and with 50 dimensions to play with you can always find a projection that makes the 20 points fall exactly where you want them fall.

Ramsey theory (or 'the Woolworths store alignment hypothesis')

15. minimaxir ◴[12 May 25 16:57 UTC] No.43965121[source]▶

>>43964705 #

Speaking of UMAP, a new update to the cuML library (https://github.com/rapidsai/cuml) released last month allows UMAP to feasibly be used on big data without shenanigans/spending a lot of money. This opens up quite a few new oppertunities and I'm getting very good results with.

16. daxfohl ◴[12 May 25 17:03 UTC] No.43965171[source]▶

>>43964934 #

Oh, so I think what it does is take a nearly infinite-dimensional nonlinear space, and transform it into "the N dimensional linear space that best preserves approximations of linear combinations of elements". That way, any two (or more) terms can combine to make others, so there isn't such a thing as "prime" terms (similar to real dictionaries, every word is defined in terms of other words). Though some, like gender, may have strong enough correlations so as to be approximately prime in a large enough space. Is that about right?

17. gweinberg ◴[12 May 25 17:04 UTC] No.43965183[source]▶

>>43964392 (TP) #

It's not at all a nit. If one of the dimensions did indeed correspond to gender, you might find "king" and "queen" pretty much only differed in one dimension. More generally, if these dimensions individually refer to human-meaningful concepts, you can find out what these concepts are just by looking at words that pretty much differ only along one dimension.

replies(2): >>43965734 #>>43975861 #

18. rahimnathwani ◴[12 May 25 17:10 UTC] No.43965258[source]▶

>>43964392 (TP) #

Nice article related to the last point (nearly orthogonal vectors):

https://transformer-circuits.pub/2022/toy_model/index.html

19. manmal ◴[12 May 25 17:47 UTC] No.43965683[source]▶

>>43964659 #

Don’t the high end embedding services use a transformer with attention to compute embeddings? If so, I thought that would indeed capture the semantic meaning quite well, including the trait-is-described-by-direction-vector.

replies(1): >>43966866 #

20. manmal ◴[12 May 25 17:50 UTC] No.43965713[source]▶

>>43964509 #

This video explains the direction-encodes-trait topic very well IMO: https://youtu.be/wjZofJX0v4M

It’s the first in a series of three that I can very highly recommend.

> there's no way that there's a 1-to-1 correspondence between concepts and dimensions.

I don’t know about that! Once you go very high dimensional, there is a lot of direction vectors that are almost perfectly perpendicular to each other (meaning they can cleanly encode a trait). Maybe they don’t even need to be perfectly perpendicular, the dot product just needs to be very close to zero.

21. yorwba ◴[12 May 25 17:50 UTC] No.43965720[source]▶

>>43964659 #

> there are a lot of dimensions involved and with 50 dimensions to play with you can always find a projection that makes the 20 points fall exactly where you want them fall.

While it would certainly have been possible to choose a projection where the two groups of words are linearly separable, that isn't even the case for https://nlp.stanford.edu/projects/glove/images/man_woman.jpg : "woman" is inside the "nephew"-"man"-"earl" triangle, so there is no way to draw a line neatly dividing the masculine from the feminine words. But I think the graph wasn't intended to show individual words classified by gender, but rather to demonstrate that in pairs of related words, the difference between the feminine and masculine word vectors points in a consistent direction.

Of course that is hardly useful for anything (if you could compare unrelated words, at least you would've been able to use it to sort lists...) but I don't think the GloVe authors can be accused of having created unrealistic graphs when their graph actually very realistically shows a situation where the kind of simple linear classifier that people would've wanted doesn't exist.

replies(1): >>43967021 #

22. alok-g ◴[12 May 25 17:50 UTC] No.43965725[source]▶

>>43964392 (TP) #

>> The king - man + woman ≈ queen anecdote ...

>> nit. This suggests that the model contains a direction with some notion of gender ...

In fact, it is likely even more restrictive ...

Even if the said vector arithmetic were to be (approximately) honored by the gender-specific words, it only means there's a specific vector (with a specific direction and magnitude) for such gender translation. 'Woman' + 'king - man' goes to 'queen, however, p * ('king - man') with p being significantly different from one may be a different relation altogether.

The meaning of the vector 'King' - 'man' may be further restricted in that the vector added to a 'Queen' need not land onto some still more royal version of a queen! The networks can learn non-linear behaviors, so the meaning of the vector could be dependent on something about the starting position too.

... unless shown otherwise via experimental data or some reasoning.

23. otabdeveloper4 ◴[12 May 25 17:52 UTC] No.43965734[source]▶

>>43965183 #

That's the layman intuition, but actual models can give surprising results.

You can test this hypothesis with some clever LLM prompting. When I did this I got "male monarch" for "king" but "British ruler" for "queen".

Oops!

replies(1): >>43966827 #

24. emaro ◴[12 May 25 17:54 UTC] No.43965751[source]▶

>>43965081 #

Basically, but it gets even better. If you allow directions of 'meaning' do wiggle a little bit (say, between 89 and 91 degrees to all other directions), you get a lot more degrees of freedom. In 3 dimensions, you still only get 3 meaningful directions with that wiggle-freedom. However in high-dimensional spaces, this small additional freedom allows you to fit a lot more almost orthogonal directions than the number of strictly orthogonal ones. That means in a 1000-dimensional space you can fit a huge number >> 1000 of binary concepts.

25. kaycebasques ◴[12 May 25 18:01 UTC] No.43965841{3}[source]▶

>>43965033 #

Realistically, it's probably the combination of both embeddings and text generation models. Embeddings are a crucial technology for making more progress on the intractable challenges of technical writing [1] but then text generation models are key for applying automated updates.

[1] https://technicalwriting.dev/strategy/challenges.html

26. pyinstallwoes ◴[12 May 25 18:14 UTC] No.43965971[source]▶

>>43964392 (TP) #

I posit the fundamental foundation for logic is the recognition of the penis and vagina. From there follows spatial recognition and difference.

27. ohxh ◴[12 May 25 19:12 UTC] No.43966531[source]▶

>>43964392 (TP) #

Johnson-lindenstrauss lemma [1] for anyone curious. But you can only map to k>8(\ln N)/\varepsilon ^{2}} if you want to preserve distances within a factor of \varepsilon with a JL-transform. This is tight up to a constant factor too.

I always wondered: if we want to preserve distances between a billion points within 10%, that would mean we need ~18k dimensions. 1% would be 1.8m. Is there a stronger version of the lemma for points that are well spread out? Or are embeddings really just fine with low precision for the distance?

[1] https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...

28. gweinberg ◴[12 May 25 19:50 UTC] No.43966827{3}[source]▶

>>43965734 #

I'm sorry, I don't get your point at all, and have no idea what you mean by "did this". If you asked for an embedding, you would have gotten a 768 (or whatever) dimensional array right?

replies(1): >>43968957 #

29. realbenpope ◴[12 May 25 19:54 UTC] No.43966866{3}[source]▶

>>43965683 #

You are correct. https://deepmind.google/research/publications/157741/

30. tyho ◴[12 May 25 20:04 UTC] No.43966937[source]▶

>>43964649 #

FFS, it's too late for me to edit. You are of course correct.

31. avidiax ◴[12 May 25 20:13 UTC] No.43967021{3}[source]▶

>>43965720 #

> the two groups of words are linearly separable

This is missing the point. What we have is two dimensions* of hundreds, but those two dimensions chosen show that the vector between a masculine word and its feminine counterpart is very nearly constant, at least across these words and excluding other dimensions.

What you're saying, a line/plane/hyper-plane that separates a dimension of gender into male and female, might also exist. But since gender neutral terms also exist, we would expect that to be a plane at which gender neutral terms have a 50/50% chance of falling to either side of the plane, and ideally nearby.

* Possibly a pseudo dimension that's a composite of multiple dimensions; IDK, I didn't read the paper.

replies(1): >>43967090 #

32. tomrod ◴[12 May 25 20:23 UTC] No.43967090{4}[source]▶

>>43967021 #

Just needs to be a separating manifold if we use the kernel trick ;)

33. rdtsc ◴[12 May 25 20:33 UTC] No.43967165[source]▶

>>43964392 (TP) #

> you find that the number of nearly orthogonal dimensions within n dimensional space is exponential with regards to n.

nit for the nit (micro nit!): Is it meant to be "a number of nearly orthogonal directions within n dimensional space"? Otherwise n dimensional space will have just n dimensions.

replies(1): >>43968922 #

34. kaycebasques ◴[13 May 25 01:22 UTC] No.43968922[source]▶

>>43967165 #

Yes, confirmed here: https://news.ycombinator.com/item?id=43966937

replies(1): >>43969043 #

35. kaycebasques ◴[13 May 25 01:29 UTC] No.43968957{4}[source]▶

>>43966827 #

For word2vec I know that there's a bunch of demos that let you do the king - man + woman computation, but I don't know how you do this with modern embeddings. https://turbomaze.github.io/word2vecjson/

36. rdtsc ◴[13 May 25 01:48 UTC] No.43969043{3}[source]▶

>>43968922 #

Ah perfect! Thanks for sharing the article.

37. pletnes ◴[13 May 25 18:06 UTC] No.43975861[source]▶

>>43965183 #

There’s absolutely no reason to believe that the coordinate system of the embeddings would be aligned along the directions of individual concepts, even if they were linear and one dimensional in the embedding space.

↑