Show HN: Chonky – a neural approach for text semantic chunking

1. michaelmarkell ◴[13 Apr 25 13:37 UTC] No.43672718[source]▶

It seems to me like chunking (or some higher order version of it like chunking into knowledge graphs) is the highest leverage thing someone can work on right now if trying to improve intelligence of AI systems like code completion, PDF understanding etc. I’m surprised more people aren’t working on this.

replies(2): >>43672966 #>>43674419 #

2. serjester ◴[13 Apr 25 14:15 UTC] No.43672966[source]▶

>>43672718 (TP) #

Chunking is less important in the long context era with most people just pulling in top 20 K. You obviously don’t want to butcher it, but you’ve got a lot of room for error.

replies(4): >>43673019 #>>43673160 #>>43673912 #>>43674441 #

3. lmeyerov ◴[13 Apr 25 14:22 UTC] No.43673019[source]▶

>>43672966 #

Yeah exactly

We still want chunking in practice to avoid LLM confusion, undifferentiated embeddings, and handling large datasets at lower cost + large volumes. Large context means we can now tolerate multi-paragraph/page, so more like chunk by coherent section.

In theory we can do entire chapter/book, but those other concerns come in, so I only see more niche tools or talk-to-your-PDF do that.

At the same time, embedding is often a significant cost in above scenarios, so I'm curious about the semantic chunking overheads..

4. michaelmarkell ◴[13 Apr 25 14:43 UTC] No.43673160[source]▶

>>43672966 #

In our use-case we have many gigabytes of PDFs that contain some qualitative data but also many pages of inline pdf tables. In an ideal world we’d be “compressing” those embedded tables into some text that says “there’s a table here with these columns, if you want to analyze it you can use this <tool>, but basically the table is talking about X, here are the relevant stats like mean, sum, cardinality.”

In the naive chunking approach, we would grab random sections of line items from these tables because they happen to reference some similar text to the search query, but there’s no guarantee the data pulled into context is complete.

5. DeveloperErrata ◴[13 Apr 25 16:23 UTC] No.43673912[source]▶

>>43672966 #

Trueish - for orgs that can't use API models for regulatory or security reasons, or that just need really efficient high throughput models, setting up your own infra for long context models can still be pretty complicated and expensive. Careful chunking and thoughtful design of the RAG system often still matters a lot in that context.

6. J_Shelby_J ◴[13 Apr 25 17:34 UTC] No.43674419[source]▶

>>43672718 (TP) #

That makes me feel better about spending so much time implementing this balanced text chunker last year. https://github.com/ShelbyJenkins/llm_utils

It splits an input text into equal sized chunks using DFS and parallelization (rayon) to do so relatively quickly.

However, the goal for me is to use a n LLM to split text by topic. I’m thinking I will implement it as an API saas service on top of it being OSS. Do you think it’s a viable business? You send a library of text, and receive a library of single topic context chunks as output.

7. J_Shelby_J ◴[13 Apr 25 17:37 UTC] No.43674441[source]▶

>>43672966 #

“Performance is less important in an era of multi-core CPUs.”