Embeddings are underrated (2024)

1. jacobr1 ◴[12 May 25 15:36 UTC] No.43964219[source]▶

I may have missed it ... but were any direct applications to tech writers discussed in this article? Embeddings are fascinating and very important for things like LLMs or semantic search, but the author seems to imply more direct utility.

replies(4): >>43964349 #>>43964388 #>>43964584 #>>43964664 #

2. PaulHoule ◴[12 May 25 15:47 UTC] No.43964349[source]▶

>>43964219 (TP) #

Semantic search and classification and clustering. For the first, there is a substantial breakthrough in IR every 10 years or so you take what you can get. (I got so depressed reading TREC proceedings which seemed to prove that "every obvious idea to improve search relevance doesn't work" and it wasn't until I found a summary of the first ten years that I learned that the first ten years had turned up one useful result, BM2.5)

As for classification, it is highly practical to put a text through an embedding and then run the embedding through a classical ML algorithm out of

https://scikit-learn.org/stable/supervised_learning.html

This works so consistently that I'm considering not packing in a bag-of-words classifier in a text classification library I'm working on. People who hold court on Huggingface forums tends to believe you can do better with fine-tuned BERT, and I'd agree you can do better with that, but training time is 100x and maybe you won't.

20 years ago you could make bag-of-word vectors and put them through a clustering algorithm

https://scikit-learn.org/stable/modules/clustering.html

and it worked but you got awful results. With embeddings you can use a very simple and fast algorithm like

https://scikit-learn.org/stable/modules/clustering.html#k-me...

and get great clusters.

I'd disagree with the bit that it takes "a lot of linear algebra" to find nearby vectors, it can be done with a dot product so I'd say it is "a little linear algebra"

3. podgietaru ◴[12 May 25 15:51 UTC] No.43964388[source]▶

>>43964219 (TP) #

I built an rss aggregator with semantic search using embeddings. The main usage was being able to categorise based on any randomly created category. So you could have arbitrary categories

https://github.com/aws-samples/rss-aggregator-using-cohere-e...

Unfortunately I no longer work at AWS so the infrastructure that was running it is down.

4. kaycebasques ◴[12 May 25 16:08 UTC] No.43964584[source]▶

>>43964219 (TP) #

> were any direct applications to tech writers discussed in this article

No, it was supposed to be a teaser post followed up by more posts and projects exploring the different applications of embeddings in technical writing (TW). But alas, life happened, and I'm now a proud new papa with a 3-month old baby :D

I do have other projects and embeddings-related posts in the pipeline. Suffice to say, embeddings can help us make progress on all 3 of the "intractable" challengs of TW mentioned here: https://technicalwriting.dev/strategy/challenges.html

replies(2): >>43965295 #>>43966786 #

5. sansseriff ◴[12 May 25 16:15 UTC] No.43964664[source]▶

>>43964219 (TP) #

It would be great to semantically search through literature with embeddings. At least one person I know if is trying to generate a vector database of all arxiv papers.

The big problem I see is attribution and citations. An embedding is just a vector. It doesn't contain any citation back to the source material or modification date or certificate of authenticity. So when using embeddings in RAG, they only serve to link back to a particular page of source material.

Using embeddings as links doesn't dramatically change the way citation and attribution are handled in technical writing. You still end up citing a whole paper or a page of a paper.

I think GraphRAG [1] is a more useful thing to build on for technical literature. There's ways to use graphs to cite a particular concept of a particular page of an academic paper. And for the 'citations' to act as bidirectional links between new and old scientific discourse. But I digress

[1] https://microsoft.github.io/graphrag/

replies(1): >>43969021 #

6. jacobr1 ◴[12 May 25 17:12 UTC] No.43965295[source]▶

>>43964584 #

Thanks for sharing regardless. It was a good overview for those less familiar with the material.

7. kaycebasques ◴[12 May 25 19:43 UTC] No.43966786[source]▶

>>43964584 #

Also re: a direct application I forgot to mention this: https://www.tdcommons.org/dpubs_series/8057/

(It finally published last week after being in review purgatory for months)

8. kaycebasques ◴[13 May 25 01:44 UTC] No.43969021[source]▶

>>43964664 #

IMO, for technical writing, citing a page or section within a page is usually good enough. I rarely need to cite a particular concept. But I've never even thought of the possibility of more granular concept-level citations and will definitely be pondering it more!