←back to thread

169 points hessdalenlight | 2 comments | | HN request time: 0.422s | source

TLDR: I’ve made a transformer model and a wrapper library that segments text into meaningful semantic chunks.

The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences).

I propose a fully neural approach to semantic chunking.

I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti.

The library could be used as a text splitter module in a RAG system or for splitting transcripts for example.

The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model.

The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased.

Please give it a try. I'll appreciate a feedback.

The Python library: https://github.com/mirth/chonky

The transformer model: https://huggingface.co/mirth/chonky_distilbert_base_uncased_...

1. oezi ◴[] No.43671492[source]
Just to understand: The model is trained to put paragraph breaks into text. The training dataset is books (in contrast for instance to scientific articles or advertising flyers).

It shouldn't break sentences at commas, right?

replies(1): >>43672057 #
2. hessdalenlight ◴[] No.43672057[source]
No it shouldn't but since it neural net there is a small chance though.