Show HN: Chonky – a neural approach for text semantic chunking

Training a splitter based on existing paragraph conventions is really cool. Actually, that's a task I run into frequently (trying to turn YouTube auto-transcript blob of text into readable sentences). LLMs tend to rewrite the text a bit too much instead of just adding punctuation.

As for RAG, I haven't noticed LLMs struggling with poorly structured text (e.g. the YouTube wall of text blob can just be fed directly into LLMs), though I haven't measured this.

In fact my own "webgrep" (convert top 10 search results into text and run grep on them, optionally followed by LLM summary) works on the byte level (gave up chunking words, sentences and paragraphs entirely): I just shove the 1kb before and after the match into the context. This works fine because LLMs just ignore the "mutilated" word parts at the beginning and end.

The only downside of this approach is that if I was the LLM, I would probably be unhappy with my job!

As for semantic chunking (in the context of, maximize the relevance of stuff that goes into the LLM, or indeed as a semantic search for the user), I haven't solved it yet, but I can share one amusing experiment: to find the relevant part of the text (having already returned a mostly-relevant big chunk of text), chop off one sentence at a time and re-run the similarity check! So you "distil" the text down to that which is most relevant (according to the embedding model) to the user query.

This is very slow and stupid, especially in real-time (though kinda fun to watch), but kinda works for the "approximately one sentence answers my question" scenario. A much cheaper approximation here would just be to embed at the sentence level as well as the page/paragraph level.