Embeddings are underrated (2024)

If embeddings are roughly the equivalent of a hash at least insofar as they transform a large input into some kind of "content-addressed distillation" (ignoring the major difference that a hash is opaque whereas an embedding has intrinsic meaning), has there been any research done on "cracking" them? That is, starting from an embedding and working backwards to generate a piece of text that is semantically close by?

I could imagine an LLM inference pipeline where the next token ranking includes its similarity to the target embedding, or perhaps instead the change in direction towards/away from the desired embedding that adding it would introduce.

Put another way, the author gives the example:

> embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")

What if you could do that but for whole bodies of text?

I'm imagining being able to do "semantic algebra" with whole paragraphs/articles/books. Instead of just prompting an LLM to "adjust the tone to be more friendly", you could have the core concept of "friendly" (or some more nuanced variant thereof) and "add" it to your existing text, etc.