←back to thread

176 points nxa | 1 comments | | HN request time: 0.256s | source

I've been playing with embeddings and wanted to try out what results the embedding layer will produce based on just word-by-word input and addition / subtraction, beyond what many videos / papers mention (like the obvious king-man+woman=queen). So I built something that doesn't just give the first answer, but ranks the matches based on distance / cosine symmetry. I polished it a bit so that others can try it out, too.

For now, I only have nouns (and some proper nouns) in the dataset, and pick the most common interpretation among the homographs. Also, it's case sensitive.

Show context
godelski ◴[] No.43989245[source]

  data + plural = number
  data - plural = research
  king - crown = (didn't work... crown gets circled in red)
  king - princess = emperor
  king - queen = kingdom
  queen - king = worker
  king + queen = queen + king = kingdom
  boy + age = (didn't work... boy gets circled in red)
  man - age = woman
  woman - age = newswoman
  woman + age = adult female body (tied with man)
  girl + age = female child
  girl + old = female child
The other suggestions are pretty similar to the results I got in most cases. But I think this helps illustrate the curse of dimensionality (i.e. distances are ill-defined in high dimensional spaces). This is still quite an unsolved problem and seems a pretty critical one to resolve that doesn't get enough attention.
replies(9): >>43989480 #>>43989843 #>>43989994 #>>43990000 #>>43990270 #>>43992122 #>>43994931 #>>43996398 #>>44000804 #
n2d4 ◴[] No.43989843[source]
For fun, I pasted these into ChatGPT o4-mini-high and asked it for an opinion:

   data + plural    = datasets
   data - plural    = datum
   king - crown     = ruler
   king - princess  = man
   king - queen     = prince
   queen - king     = woman
   king + queen     = royalty
   boy + age        = man
   man - age        = boy
   woman - age      = girl
   woman + age      = elderly woman
   girl + age       = woman
   girl + old       = grandmother

The results are surprisingly good, I don't think I could've done better as a human. But keep in mind that this doesn't do embedding math like OP! Although it does show how generic LLMs can solve some tasks better than traditional NLP.

The prompt I used:

> Remember those "semantic calculators" with AI embeddings? Like "king - man + woman = queen"? Pretend you're a semantic calculator, and give me the results for the following:

replies(5): >>43989988 #>>43990061 #>>43990761 #>>43991235 #>>43998165 #
godelski ◴[] No.43991235[source]

  > The results are surprisingly good, I don't think I could've done better as a human
I'm actually surprised that the performance is so poor and would expect a human to do much better. The GPT model has embedding PLUS a whole transformer model that can untangle the embedded structure.

To clarify some of the issues:

  data is both singular and plural, being a mass noun[0,1]. Datum is something you'll find in the dictionary, but not common in use[2]. The dictionary lags actual definitions. I mean words only mean what we collectively agree they mean (dictionary definitely helps with that but we also invent words all the time -- i.e. slang). I see how this one could trick up a human, feeling the need to change the output and would likely consult a dictionary but I don't think that's a fair comparison here as LLMs don't have these same biases.

  King - crown really seems like it should be something like "man" or "person". The crown is the manifestation of the ruling power. We still use phrases like "heavy is the head that wears the crown" in reference to general leaders, not just monarchs.

  king - princess I honestly don't know what to expect. Man is technically gender neutral so I'll take this one.

  king - queen I would expect similar outputs to the previous one. Don't quite agree here.

  queen - king I get why is removing royalty but given the previous (two) results I think is showing a weird gender bias. Remember that queen is something like (woman + crown) and king is akin to (man + crown). So subtracting should be woman - man. 

  The others I agree with. These were actually done because I was quite surprised at the results and was thinking about the aforementioned gender bias.

  > But keep in mind that this doesn't do embedding math like OP!
I think you are misunderstanding the architecture of these models. The embedding sub-network is the translation of text to numeric tokens. You'll find mention of the embedding sub-networks in both the GPT3[3] and GPT4 papers. Though they are given lower importance than other works. While much smaller than the main network, don't forget that embedding networks are still quite large. For the smaller models they constitute a significant part of the total parameter count[4]

After the embedding sub-network is your main transformer network. The purpose of this network is to perform embedding math! It is just that the goal is to do significantly more complicated math. Remember, these are learnable mappings (see Optimal Transport). We're just breaking it down into their two main intermediate mappings. But the embeddings still end up being a bottleneck. It is your literal gateway from words to numbers.

[0] https://en.wikipedia.org/wiki/Mass_noun

[1] https://www.merriam-webster.com/dictionary/data

[2] https://www.sciotoanalysis.com/news/2023/1/18/this-data-or-t...

[3] https://arxiv.org/abs/2005.14165

[4] https://arxiv.org/abs/2303.08774

[4] https://www.lesswrong.com/posts/3duR8CrvcHywrnhLo/how-does-g...

replies(3): >>43991448 #>>43992037 #>>43995084 #
1. drabbiticus ◴[] No.43995084[source]
The specific cherry-picked examples from GP make sense to me.

   data + plural    = datasets 
   data - plural    = datum
If +/- plural can be taken to mean "make explicitly plural or singular", then this roughly works.

   king - crown     = ruler
Rearrange (because embeddings are just vector math), and you get "king = ruler + crown". Yes, a king is a ruler who has a crown.

   king - princess  = man
This isn't great, I'll grant, but there are many YA novels where someone becomes king (eventually) through marriage to a princess, or there is intrigue for the princess's hand for reasons of kingly succession, so "king = man + princess" roughly works.

   king - queen     = prince
   queen - king     = woman
I agree it's hard to make sense of "king - queen = prince". "A queen is a woman king" is often how queens are described to young children. In Chinese, it's actually the literal breakdown of 女王. I also agree there's a gender bias, but also literally everything about LLMs and various AI trained on large human-generated data encodes the bias of how we actually use language and thought patterns. It's one of the big concerns of those in the civil liberties space. Search "llm discrimination" or similar for more on this.

Playing around with age/time related gives a lot of interesting results:

    adult + age = adulthood
    child + age = female child
    year + age = chronological age
    time + year = day
    child + old = today
    adult - old = adult body
    adult - age = powerhouse
    adult - year = man
I think a lot of words are hard to distill into a single embedding. A word may embed a number of conceptually distinct definitions, but my (incomplete) understanding of embeddings is that they are not context-sensitive, right? So averaging those distinct definitions through 1 label is probably fraught with problems when trying to do meaningful vector math with them that context/attention are able to help with.

[EDIT:formatting is hard without preview]