Do you know of any pure Python wrapper projects for managing large numbers of text and PDF documents? I thought of using Solr or ElasticSearch but that seems too heavy weight for what I am doing. I am considering using SQLite with pysqlite3 and PyPDF2 since SQLite uses BM25. Sorry to be off topic, but I imagine many people are looking at tools for building hybrid BM25 / vector store / LLM applications.
You can store your text and PDFs in SQLite (or their filenames) and use the FTS5 infrastructure to do tokenization, query execution, and ranking. You can write your own tokenizer in Python, as well as ranking functions. A pure Python tokenizer for HTML is included, as well as a pure Python implementation of BM25.
You can chain tokenizers so it is just a few lines of code to call pypdf's extract_text method, and then have the bundled UnicodeWords tokenizer properly extract tokens/words, and Simplify to do case folding and accent stripping if desired.
There is a lot more useful functionality, all done from Python. You can see code in action in the example/tour at https://rogerbinns.github.io/apsw/example-fts.html