As for classification, it is highly practical to put a text through an embedding and then run the embedding through a classical ML algorithm out of
https://scikit-learn.org/stable/supervised_learning.html
This works so consistently that I'm considering not packing in a bag-of-words classifier in a text classification library I'm working on. People who hold court on Huggingface forums tends to believe you can do better with fine-tuned BERT, and I'd agree you can do better with that, but training time is 100x and maybe you won't.
20 years ago you could make bag-of-word vectors and put them through a clustering algorithm
https://scikit-learn.org/stable/modules/clustering.html
and it worked but you got awful results. With embeddings you can use a very simple and fast algorithm like
https://scikit-learn.org/stable/modules/clustering.html#k-me...
and get great clusters.
I'd disagree with the bit that it takes "a lot of linear algebra" to find nearby vectors, it can be done with a dot product so I'd say it is "a little linear algebra"
https://github.com/aws-samples/rss-aggregator-using-cohere-e...
Unfortunately I no longer work at AWS so the infrastructure that was running it is down.
No, it was supposed to be a teaser post followed up by more posts and projects exploring the different applications of embeddings in technical writing (TW). But alas, life happened, and I'm now a proud new papa with a 3-month old baby :D
I do have other projects and embeddings-related posts in the pipeline. Suffice to say, embeddings can help us make progress on all 3 of the "intractable" challengs of TW mentioned here: https://technicalwriting.dev/strategy/challenges.html
The big problem I see is attribution and citations. An embedding is just a vector. It doesn't contain any citation back to the source material or modification date or certificate of authenticity. So when using embeddings in RAG, they only serve to link back to a particular page of source material.
Using embeddings as links doesn't dramatically change the way citation and attribution are handled in technical writing. You still end up citing a whole paper or a page of a paper.
I think GraphRAG [1] is a more useful thing to build on for technical literature. There's ways to use graphs to cite a particular concept of a particular page of an academic paper. And for the 'citations' to act as bidirectional links between new and old scientific discourse. But I digress
(It finally published last week after being in review purgatory for months)