Building a Simple Search Engine That Works

1. marginalia_nu ◴[17 Nov 25 09:44 UTC] No.45952174[source]▶

The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].

The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.

[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/

replies(7): >>45952237 #>>45952734 #>>45952769 #>>45952991 #>>45953075 #>>45953286 #>>45954345 #

2. submeta ◴[17 Nov 25 09:55 UTC] No.45952237[source]▶

>>45952174 (TP) #

Thank you very much for the recommendation. I am in the process of building knowledge base bots, and am confronted with the task of creating various crawlers for the different sources the company has. And this book comes in very handy.

3. HelloUsername ◴[17 Nov 25 11:42 UTC] No.45952769[source]▶

>>45952174 (TP) #

I love your https://marginalia-search.com :)

replies(1): >>45952830 #

4. marginalia_nu ◴[17 Nov 25 11:53 UTC] No.45952830[source]▶

>>45952769 #

"Building A Complex Search Engine That Works Sometimes"

replies(1): >>45953299 #

5. gcanyon ◴[17 Nov 25 12:25 UTC] No.45952991[source]▶

>>45952174 (TP) #

> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

I would expect the difficulty to be deciding which item to return when there are multiple that contain the search term. Is wikipedia's article on Gilligan's Island better than some guy's blog post? Or is that guy a fanatic who has spent his entire life pondering whether Wrongway Feldman was malicious or how Irving met Bingo Bango and Bongo?

Add in rank hacking, keyword stuffing, etc. and it seems like a very hard problem, while scaling... is scaling? ¯\_(ツ)_/¯

replies(2): >>45953018 #>>45953084 #

6. marginalia_nu ◴[17 Nov 25 12:29 UTC] No.45953018[source]▶

>>45952991 #

That would be the "handling underspecified queries" thing I mentioned.

7. mapt ◴[17 Nov 25 12:42 UTC] No.45953075[source]▶

>>45952174 (TP) #

What is the order of magnitude of the largest document store that you can practically work from SQLite on a single thousand-dollar server run by some text-heavy business process? For text search, roughly how big of a corpus can we practically search if we're occupying... let's say five seconds per query, twelve queries per minute?

replies(1): >>45953703 #

8. dumbfounder ◴[17 Nov 25 12:43 UTC] No.45953084[source]▶

>>45952991 #

Elastic and many others fail to solve this problem too. There are many different strategies and many of them require ingenuity and development.

replies(1): >>45953256 #

9. jonstewart ◴[17 Nov 25 13:11 UTC] No.45953256{3}[source]▶

>>45953084 #

It’s not like ElasticSearch lacks ranking algorithms and control thereof. But it can require tuning and adjustment for various domains. Relevancy is, after all, subjective.

10. djoldman ◴[17 Nov 25 13:17 UTC] No.45953286[source]▶

>>45952174 (TP) #

> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

Large amounts of data seem obviously difficult.

For your second difficulty, "handling underspecified queries": it seems to me that's a subset of the problem of, "given a query, what are the most relevant results?" That problem seems very tricky, partially because there is no exact true answer.

marginalia search is great as a contrast to engines like google, in part because google chooses to display advertisements as the most relevant results.

Have you found any of the TREC papers helpful?

https://trec.nist.gov/

11. moffkalast ◴[17 Nov 25 13:19 UTC] No.45953299{3}[source]▶

>>45952830 #

15% of the time it works every time.

12. marginalia_nu ◴[17 Nov 25 14:12 UTC] No.45953703[source]▶

>>45953075 #

If you held a gun to my head and forced me to make a guess I'd say you could push that approach to order of 100K, maybe 1M documents.

If sqlite had a generic "strictly ascending sequence of integers" type[1] and would optimize around that, you could probably push it farther in terms of implementing efficient inverted indexes.

[1] primary key tables aren't really useful here.

replies(2): >>45958564 #>>45959799 #

13. zipy124 ◴[17 Nov 25 15:19 UTC] No.45954345[source]▶

>>45952174 (TP) #

I think in today's world the harder problem is evading SEO spam. A search engine is in constant war with adverserarial players, who need you to see their content for revenue, rather than the actual answer.

This neccessitates a constant game of cat and mouse, where you adjust your quality metric so SEO shops can't figure it out and capitalise on it.

replies(3): >>45954581 #>>45954763 #>>45955477 #

14. zppln ◴[17 Nov 25 15:43 UTC] No.45954581[source]▶

>>45954345 #

I feel at this point you'd almost be better off hand-curating a set of domains and only crawl those.

replies(1): >>45956058 #

15. jayd16 ◴[17 Nov 25 15:58 UTC] No.45954763[source]▶

>>45954345 #

I wonder how hard it is when mice are not paying the cat to serve ads.

replies(1): >>45959378 #

16. HEmanZ ◴[17 Nov 25 16:59 UTC] No.45955477[source]▶

>>45954345 #

There are more kinds of search engines than just internet search engines. At this point I’m is almost certain that the non-internet search engines of the world are much larger than internet search engines.

Edit: And I’m getting downvoted for this. If it’s because I am tangential to the original comment then that’s fair. If it’s because you think I’m wrong, I have worked on the two largest internet search engines in the world and one non-internet search engine that dwarfed both in size (although different in complexity).

replies(2): >>45956098 #>>45960053 #

17. skeeter2020 ◴[17 Nov 25 17:55 UTC] No.45956058{3}[source]▶

>>45954581 #

not sure if this was intentional, but everything old is new again; back to OH yahoo? or Craig's list?

replies(1): >>45956327 #

18. ◴[17 Nov 25 17:59 UTC] No.45956098{3}[source]▶

>>45955477 #

19. graemep ◴[17 Nov 25 18:23 UTC] No.45956327{4}[source]▶

>>45956058 #

Not quite, in that you can curate domains but crawl all the urls on those domains.

I think SEO plam + AI slop is likely to lead us back to human curation.

20. radiator ◴[17 Nov 25 21:31 UTC] No.45958564{3}[source]▶

>>45953703 #

From my experience, SQLite's FTS5 is orders of magnitude more performant than that, i.e. for 100K documents, 7 queries/second on some of the cheapest 1 vCPU Virtual Machines.

But it is true that a specialized search engine using a more clever algorithm might be another order of magnitude faster.

21. marginalia_nu ◴[17 Nov 25 22:55 UTC] No.45959378{3}[source]▶

>>45954763 #

It sure helps, though there's still a lot of adversarial content you still need to deal with, so it's not a solved problem even if you remove the conflict of interest.

22. luizfelberti ◴[17 Nov 25 23:47 UTC] No.45959799{3}[source]▶

>>45953703 #

> If sqlite had a generic "strictly ascending sequence of integers" type

Is that not what WITHOUT ROWID does? My understanding is that it's precisely meant to physically cluster data in the underlying B-Tree

If that is not what you meant, could you elaborate on the "primary key tables aren't really useful here" footnote?

23. dafelst ◴[18 Nov 25 00:28 UTC] No.45960053{3}[source]▶

>>45955477 #

What do you mean by a non-internet search engine, and what might be one that is bigger than Google/Bing?

replies(1): >>45963068 #

24. HEmanZ ◴[18 Nov 25 09:28 UTC] No.45963068{4}[source]▶

>>45960053 #

You’ve got to remember that google/bing do not index the internet entire. Part of their magic is selectively indexing only a tiny sliver and still being effective.

Other kinds of search systems have to index everything, which simplifies things but has its own scaling challenges.

Easiest way to think about it is that while the majority of webpages are never indexed, every blob of text in a social media post, private message in an app, email, document, etc in every major app in the world, including the ones with billions of users, is indexed in a search engine for that app:

- GSuite search (think of how many gmails are searchable in the world right now… and they are all indexed)

- the enterprise search powering ChatGPT, Claude (these maybe there by now, if not they are likely well on the way)

- The Microsoft 365 search (this is probably massive with so many corporate email systems and teams systems on it)

- slack search

- X(twitter) search

- ticktock search (this idk, I’ve never used ticktock but if every video and every comment is searchable then this is probably huge)

- Facebook search (especially since this is likely combined across its product suite)

These are probably all larger in effective size than google or bing.