←back to thread

169 points hessdalenlight | 1 comments | | HN request time: 0s | source

TLDR: I’ve made a transformer model and a wrapper library that segments text into meaningful semantic chunks.

The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences).

I propose a fully neural approach to semantic chunking.

I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti.

The library could be used as a text splitter module in a RAG system or for splitting transcripts for example.

The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model.

The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased.

Please give it a try. I'll appreciate a feedback.

The Python library: https://github.com/mirth/chonky

The transformer model: https://huggingface.co/mirth/chonky_distilbert_base_uncased_...

Show context
michaelmarkell ◴[] No.43672718[source]
It seems to me like chunking (or some higher order version of it like chunking into knowledge graphs) is the highest leverage thing someone can work on right now if trying to improve intelligence of AI systems like code completion, PDF understanding etc. I’m surprised more people aren’t working on this.
replies(2): >>43672966 #>>43674419 #
serjester ◴[] No.43672966[source]
Chunking is less important in the long context era with most people just pulling in top 20 K. You obviously don’t want to butcher it, but you’ve got a lot of room for error.
replies(4): >>43673019 #>>43673160 #>>43673912 #>>43674441 #
1. lmeyerov ◴[] No.43673019[source]
Yeah exactly

We still want chunking in practice to avoid LLM confusion, undifferentiated embeddings, and handling large datasets at lower cost + large volumes. Large context means we can now tolerate multi-paragraph/page, so more like chunk by coherent section.

In theory we can do entire chapter/book, but those other concerns come in, so I only see more niche tools or talk-to-your-PDF do that.

At the same time, embedding is often a significant cost in above scenarios, so I'm curious about the semantic chunking overheads..