←back to thread

Embeddings are underrated (2024)

(technicalwriting.dev)
484 points jxmorris12 | 1 comments | | HN request time: 0.204s | source
1. joaoli131 ◴[] No.43967801[source]
Embeddings are still underrated—even in RAG.

Legal text is deeply hierarchical and full of pointers (“Art. 5 CF”, “see Art 34”). One vector per article leaves too much on the table.

Things that moved the needle for us:

– *Multi-layer embeds* vectors for every paragraph and every structural level (chapter → book). Retriever picks the right granularity. (arXiv:2411.07739)

– *Propositional queries* strip speech-act fluff (“could you please…”) before embedding. Similarity + top-k recall jump. (arXiv:2503.10654)

– *Poly-vector retrieval* two vectors per norm—one for content, one for the label/nickname. Handles “what does the CDC say?” and internal cross-refs. (arXiv:2504.10508)

*TL;DR* If your corpus has hierarchy or aliases, stop thinking “one doc = one embedding.” Plenty of juice to squeeze before heavier tricks.

[1] https://arxiv.org/abs/2411.07739 [2] https://arxiv.org/abs/2503.10654 [3] https://arxiv.org/abs/2504.10508