←back to thread

169 points hessdalenlight | 1 comments | | HN request time: 0.201s | source

TLDR: I’ve made a transformer model and a wrapper library that segments text into meaningful semantic chunks.

The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences).

I propose a fully neural approach to semantic chunking.

I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti.

The library could be used as a text splitter module in a RAG system or for splitting transcripts for example.

The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model.

The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased.

Please give it a try. I'll appreciate a feedback.

The Python library: https://github.com/mirth/chonky

The transformer model: https://huggingface.co/mirth/chonky_distilbert_base_uncased_...

1. andai ◴[] No.43672904[source]
Training a splitter based on existing paragraph conventions is really cool. Actually, that's a task I run into frequently (trying to turn YouTube auto-transcript blob of text into readable sentences). LLMs tend to rewrite the text a bit too much instead of just adding punctuation.

As for RAG, I haven't noticed LLMs struggling with poorly structured text (e.g. the YouTube wall of text blob can just be fed directly into LLMs), though I haven't measured this.

In fact my own "webgrep" (convert top 10 search results into text and run grep on them, optionally followed by LLM summary) works on the byte level (gave up chunking words, sentences and paragraphs entirely): I just shove the 1kb before and after the match into the context. This works fine because LLMs just ignore the "mutilated" word parts at the beginning and end.

The only downside of this approach is that if I was the LLM, I would probably be unhappy with my job!

As for semantic chunking (in the context of, maximize the relevance of stuff that goes into the LLM, or indeed as a semantic search for the user), I haven't solved it yet, but I can share one amusing experiment: to find the relevant part of the text (having already returned a mostly-relevant big chunk of text), chop off one sentence at a time and re-run the similarity check! So you "distil" the text down to that which is most relevant (according to the embedding model) to the user query.

This is very slow and stupid, especially in real-time (though kinda fun to watch), but kinda works for the "approximately one sentence answers my question" scenario. A much cheaper approximation here would just be to embed at the sentence level as well as the page/paragraph level.