(github.com)

169 points hessdalenlight | 1 comments | 11 Apr 25 12:18 UTC | HN request time: 0.268s | source

TLDR: I’ve made a transformer model and a wrapper library that segments text into meaningful semantic chunks.

The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences).

I propose a fully neural approach to semantic chunking.

I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti.

The library could be used as a text splitter module in a RAG system or for splitting transcripts for example.

The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model.

The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased.

Please give it a try. I'll appreciate a feedback.

The Python library: https://github.com/mirth/chonky

The transformer model: https://huggingface.co/mirth/chonky_distilbert_base_uncased_...

Show context

mathis-l ◴[13 Apr 25 09:10 UTC] No.43671297[source]▶

>>43652968 (OP) #

You might want to take a look at https://github.com/segment-any-text/wtpsplit

It uses a similar approach but the focus is on sentence/paragraph segmentation generally and not specifically focused on RAG. It also has some benchmarks. Might be a good source of inspiration for where to take chonky next.

replies(2): >>43673116 #>>43673879 #

1. vunderba ◴[13 Apr 25 16:17 UTC] No.43673879[source]▶

>>43671297 #

This is the library that I use, mainly around very noisy IRC chat transcripts and it works pretty well. OP I'd love to see a paragraph matching comparison benchmark against wtpsplit to see how well Chonky stacks up.

↑

Show HN: Chonky – a neural approach for text semantic chunking