(github.com)

169 points hessdalenlight | 1 comments | 11 Apr 25 12:18 UTC | HN request time: 0.371s | source

TLDR: I’ve made a transformer model and a wrapper library that segments text into meaningful semantic chunks.

The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences).

I propose a fully neural approach to semantic chunking.

I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti.

The library could be used as a text splitter module in a RAG system or for splitting transcripts for example.

The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model.

The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased.

Please give it a try. I'll appreciate a feedback.

The Python library: https://github.com/mirth/chonky

The transformer model: https://huggingface.co/mirth/chonky_distilbert_base_uncased_...

Show context

michaelmarkell ◴[13 Apr 25 13:37 UTC] No.43672718[source]▶

>>43652968 (OP) #

It seems to me like chunking (or some higher order version of it like chunking into knowledge graphs) is the highest leverage thing someone can work on right now if trying to improve intelligence of AI systems like code completion, PDF understanding etc. I’m surprised more people aren’t working on this.

replies(2): >>43672966 #>>43674419 #

serjester ◴[13 Apr 25 14:15 UTC] No.43672966[source]▶

>>43672718 #

Chunking is less important in the long context era with most people just pulling in top 20 K. You obviously don’t want to butcher it, but you’ve got a lot of room for error.

replies(4): >>43673019 #>>43673160 #>>43673912 #>>43674441 #

1. michaelmarkell ◴[13 Apr 25 14:43 UTC] No.43673160[source]▶

>>43672966 #

In our use-case we have many gigabytes of PDFs that contain some qualitative data but also many pages of inline pdf tables. In an ideal world we’d be “compressing” those embedded tables into some text that says “there’s a table here with these columns, if you want to analyze it you can use this <tool>, but basically the table is talking about X, here are the relevant stats like mean, sum, cardinality.”

In the naive chunking approach, we would grab random sections of line items from these tables because they happen to reference some similar text to the search query, but there’s no guarantee the data pulled into context is complete.

↑

Show HN: Chonky – a neural approach for text semantic chunking