Understanding the BM25 full text search algorithm

(emschwartz.me)

305 points rrampage | 2 comments | 20 Nov 24 03:43 UTC | HN request time: 0s | source

Show context

jankovicsandras ◴[20 Nov 24 06:18 UTC] No.42191251[source]▶

>>42190650 (OP) #

Shameless plug:

https://github.com/jankovicsandras/plpgsql_bm25

https://github.com/jankovicsandras/bm25opt

replies(2): >>42192810 #>>42194312 #

mark_l_watson ◴[20 Nov 24 11:16 UTC] No.42192810[source]▶

>>42191251 #

Thanks, yesterday I was thinking of adding BM25 to a little side project, so a well timed plug!

Do you know of any pure Python wrapper projects for managing large numbers of text and PDF documents? I thought of using Solr or ElasticSearch but that seems too heavy weight for what I am doing. I am considering using SQLite with pysqlite3 and PyPDF2 since SQLite uses BM25. Sorry to be off topic, but I imagine many people are looking at tools for building hybrid BM25 / vector store / LLM applications.

replies(1): >>42199238 #

rogerbinns ◴[20 Nov 24 23:28 UTC] No.42199238{3}[source]▶

>>42192810 #

My project APSW may have exactly what you need. It wraps SQLite proving a Python API, and that includes the FTS5 full text search functionality. https://rogerbinns.github.io/apsw/textsearch.html

You can store your text and PDFs in SQLite (or their filenames) and use the FTS5 infrastructure to do tokenization, query execution, and ranking. You can write your own tokenizer in Python, as well as ranking functions. A pure Python tokenizer for HTML is included, as well as a pure Python implementation of BM25.

You can chain tokenizers so it is just a few lines of code to call pypdf's extract_text method, and then have the bundled UnicodeWords tokenizer properly extract tokens/words, and Simplify to do case folding and accent stripping if desired.

There is a lot more useful functionality, all done from Python. You can see code in action in the example/tour at https://rogerbinns.github.io/apsw/example-fts.html

replies(2): >>42199410 #>>42220253 #

1. radiator ◴[20 Nov 24 23:48 UTC] No.42199410{4}[source]▶

>>42199238 #

Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?

replies(1): >>42206613 #

2. rogerbinns ◴[21 Nov 24 17:34 UTC] No.42206613[source]▶

>>42199410 (TP) #

SQLite itself is in C so you can use the API directly https://www.sqlite.org/fts5.html#custom_tokenizers

The text is in UTF8 bytes so any C code would have to deal with that and mapping to Unicode codepoints, plus lots of other text processing so some kind of library would also be needed. I don't know of any.

↑