The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences).
I propose a fully neural approach to semantic chunking.
I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti.
The library could be used as a text splitter module in a RAG system or for splitting transcripts for example.
The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model.
The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased.
Please give it a try. I'll appreciate a feedback.
The Python library: https://github.com/mirth/chonky
The transformer model: https://huggingface.co/mirth/chonky_distilbert_base_uncased_...
In the naive chunking approach, we would grab random sections of line items from these tables because they happen to reference some similar text to the search query, but there’s no guarantee the data pulled into context is complete.