←back to thread

279 points freediver | 2 comments | | HN request time: 0.006s | source
Show context
marginalia_nu ◴[] No.45952174[source]
The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].

The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.

[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/

replies(7): >>45952237 #>>45952734 #>>45952769 #>>45952991 #>>45953075 #>>45953286 #>>45954345 #
zipy124 ◴[] No.45954345[source]
I think in today's world the harder problem is evading SEO spam. A search engine is in constant war with adverserarial players, who need you to see their content for revenue, rather than the actual answer.

This neccessitates a constant game of cat and mouse, where you adjust your quality metric so SEO shops can't figure it out and capitalise on it.

replies(3): >>45954581 #>>45954763 #>>45955477 #
HEmanZ ◴[] No.45955477[source]
There are more kinds of search engines than just internet search engines. At this point I’m is almost certain that the non-internet search engines of the world are much larger than internet search engines.

Edit: And I’m getting downvoted for this. If it’s because I am tangential to the original comment then that’s fair. If it’s because you think I’m wrong, I have worked on the two largest internet search engines in the world and one non-internet search engine that dwarfed both in size (although different in complexity).

replies(2): >>45956098 #>>45960053 #
1. dafelst ◴[] No.45960053[source]
What do you mean by a non-internet search engine, and what might be one that is bigger than Google/Bing?
replies(1): >>45963068 #
2. HEmanZ ◴[] No.45963068[source]
You’ve got to remember that google/bing do not index the internet entire. Part of their magic is selectively indexing only a tiny sliver and still being effective.

Other kinds of search systems have to index everything, which simplifies things but has its own scaling challenges.

Easiest way to think about it is that while the majority of webpages are never indexed, every blob of text in a social media post, private message in an app, email, document, etc in every major app in the world, including the ones with billions of users, is indexed in a search engine for that app:

- GSuite search (think of how many gmails are searchable in the world right now… and they are all indexed)

- the enterprise search powering ChatGPT, Claude (these maybe there by now, if not they are likely well on the way)

- The Microsoft 365 search (this is probably massive with so many corporate email systems and teams systems on it)

- slack search

- X(twitter) search

- ticktock search (this idk, I’ve never used ticktock but if every video and every comment is searchable then this is probably huge)

- Facebook search (especially since this is likely combined across its product suite)

These are probably all larger in effective size than google or bing.