←back to thread

596 points yunusabd | 1 comments | | HN request time: 0s | source

I feel like it was inevitable, with the recent buzz around NotebookLM. I'm just surprised that it hasn't been done yet.
Show context
tristenharr ◴[] No.41900647[source]
Would be cool to create embeddings for historical HN posts, and then use a users favorite posts to personalize the post selection by averaging the embedding vectors for a users favorite posts then doing a cosine similarity search to select stories most likely to be of interest to a user.

Although it would be even better to use a users like history, but I’m not sure if/how those can be accessed.

Speaking of, I’m curious how other folks use embeddings. I know you can average multiple embeddings together, but is anyone else doing other translations and having success? Thinking of King - Man + Women = Queen, It seems a lot of the time I see questions being directly used as inputs for semantic search/RAG. I wonder if it might make sense to create a large set of question-answer pairs and embed them and then determine the average translation to move from “question space” to “answer space”, then when you embed questions you apply the translation on the embedding to move it into “answer space” before performing RAG, or maybe this would just add too much noise?

replies(3): >>41900871 #>>41901018 #>>41902430 #
1. vunderba ◴[] No.41900871[source]
hmmm, I can't speak to people using word2vec in conjunction with RAG, but the other use case is actually pretty common. (you don't need to generate answers though in my experience).

For each document intended for ingestion into a vector database:

- Use an LLM to generate a list of possible questions that the document is capable of answering (essentially equivalent to generating a quiz)

- Map these question embeddings back to the original documents

- Store document, document chunks, question 1, question 2, etc. into the vector database

So now when a person queries your RAG, you have the direct link from user query -> doc chunks, but additionally the transitionary link from user query -> similar query -> doc chunk.