I'm kinda allergic to writing "I did the thing" posts, so I can't help but tryhard and attempt to make them compelling somehow.
Writing in this manner is also very helpful in making sense of the work for myself. Takes a better understanding of the subject to thoroughly explain what you've built than to merely build it. Sometimes I've gone back and read through one of these updates to just get a refresher on what my thinking was when I built something.
I'm asking this as one of my projects is a link aggregator similar to old reddit (and HN to some extent) and I would like to be able to present to users a search box, but without having to implement document indexing and search. (I assume ad principio that the website is already aligned ethically and technologically with what Marginalia stands for :D)
Language detection and sentence splitting are the other two slow bits of processing.
When it works, one of the things I have in mind is making a site search-esque functionality available, as well as exposing it via the public API so that it can be whiteboxed.
Small UI issue: on Desktop, the left sidebar should be scrollable, because now on Firefox I can't reach the "Language" menu item in the search results view, unless I zoom-out.
Some fun context, I was trying to find a scanned copy of the first 'correct' book on optics (written by https://en.wikipedia.org/wiki/Ibn_al-Haytham). Possibly the first person to really use the scientific method in circa 1000CE (!!). And I found this (https://cudl.lib.cam.ac.uk/view/MS-PETERHOUSE-00209/103) filled with interesting optical diagrams like something out of my high school physics notebooks. Anyway - I was also thinking about how they might index interesting doodles in the margins. So it was on my mind.
> Sentences are stemmed and POS-tagged. Sentences, with stemming and POS-tag data is fed into keyword extraction algorithms
IS AI, it's just old fashioned and bad AI. What he's trying will never work well, for the same reason rule-based machine translation never worked well: there are just too many rules and exceptions. Simplicity is great when you can have it, but with human language, simplicity was never on the table.
He's going to have to bite the bullet and use document embedding models sooner or later.
Likely I am totally not understanding what this search engine is for. I see this a lot on submissions here. I find something interesting sounding but I don’t understand the context. Maybe it’s just me, but it’s confusing.
If you read his about page, it is basically an anti-centralization anti-ad anti-spyware attempt at websearch. It is also "The project is independent in that it has no loans, no investors looking for a payday, no strings attached anywhere to pressure it into doing anything than providing as much and as good internet search as it is capable of."
It not indexing NYT seems precisely on brand.
It's not a google replacement, and if you already know what you're looking for then it's probably not the right tool.
Maybe you're looking for mechanical keyboard discussions, then maybe a search for "mechanical keyboard" in the Blogs or Forums filters will provide results you are into.
It's also pretty good at unearthing weird stuff. Say you want to read up on Jack Parsons[3], that Jet Propulsion Lab guy who dabbled in occultism, fell in with Alistair Crowley and then got scammed out of his wealth by L Ron Hubbard, and finally blew himself up, well that is the sort of topic Marginalia Search generally excels at.
[1] https://marginalia-search.com/search?query=mechanical+keyboa...
[2] https://marginalia-search.com/search?query=mechanical+keyboa...
[3] https://marginalia-search.com/search?query=Jack+Parsons&prof...
Where it particularly shines is finding highly specific results that get buried in other search engines. Some topics (particularly topics of high commercial interest) have become impossible to research on mainstream search engines. Marginalia will actually find informative articles about these topics rather than page after page of product results and spam.
It may not be useful to you if you’re not a researcher, writer, or someone who often needs to dig deeply into subjects beyond the level of common knowledge.
Though since the search engine doesn't really apply much in terms of domain authority, this doesn't rank very highly, the websites that talk about Ezra Klein rank higher.
[1] https://marginalia-search.com/search?query=site%3Anytimes.co...
I'm confused by this. TD-IDF incorporates the term frequency (the IDF part), which search engines precompute for the index as a whole. But so does BM25; its IDF formula is slightly different, but also relies on term frequencies. What's the difference?
When searching, doing BM25, it is a lot more accessible as you already fetch that information indirectly as part of looking up the documents lists, and this is typically only done up to about a dozen times per query.