Building a Simple Search Engine That Works

(karboosx.net)

279 points freediver | 4 comments | 17 Nov 25 03:52 UTC | HN request time: 0.001s | source

Show context

marginalia_nu ◴[17 Nov 25 09:44 UTC] No.45952174[source]▶

The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].

The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.

[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/

replies(7): >>45952237 #>>45952734 #>>45952769 #>>45952991 #>>45953075 #>>45953286 #>>45954345 #

1. mapt ◴[17 Nov 25 12:42 UTC] No.45953075[source]▶

>>45952174 #

What is the order of magnitude of the largest document store that you can practically work from SQLite on a single thousand-dollar server run by some text-heavy business process? For text search, roughly how big of a corpus can we practically search if we're occupying... let's say five seconds per query, twelve queries per minute?

replies(1): >>45953703 #

2. marginalia_nu ◴[17 Nov 25 14:12 UTC] No.45953703[source]▶

>>45953075 (TP) #

If you held a gun to my head and forced me to make a guess I'd say you could push that approach to order of 100K, maybe 1M documents.

If sqlite had a generic "strictly ascending sequence of integers" type[1] and would optimize around that, you could probably push it farther in terms of implementing efficient inverted indexes.

[1] primary key tables aren't really useful here.

replies(2): >>45958564 #>>45959799 #

3. radiator ◴[17 Nov 25 21:31 UTC] No.45958564[source]▶

>>45953703 #

From my experience, SQLite's FTS5 is orders of magnitude more performant than that, i.e. for 100K documents, 7 queries/second on some of the cheapest 1 vCPU Virtual Machines.

But it is true that a specialized search engine using a more clever algorithm might be another order of magnitude faster.

4. luizfelberti ◴[17 Nov 25 23:47 UTC] No.45959799[source]▶

>>45953703 #

> If sqlite had a generic "strictly ascending sequence of integers" type

Is that not what WITHOUT ROWID does? My understanding is that it's precisely meant to physically cluster data in the underlying B-Tree

If that is not what you meant, could you elaborate on the "primary key tables aren't really useful here" footnote?

↑