Building a Simple Search Engine That Works

(karboosx.net)

279 points freediver | 2 comments | 17 Nov 25 03:52 UTC | HN request time: 0.001s | source

Show context

marginalia_nu ◴[17 Nov 25 09:44 UTC] No.45952174[source]▶

The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].

The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.

[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/

replies(7): >>45952237 #>>45952734 #>>45952769 #>>45952991 #>>45953075 #>>45953286 #>>45954345 #

zipy124 ◴[17 Nov 25 15:19 UTC] No.45954345[source]▶

>>45952174 #

I think in today's world the harder problem is evading SEO spam. A search engine is in constant war with adverserarial players, who need you to see their content for revenue, rather than the actual answer.

This neccessitates a constant game of cat and mouse, where you adjust your quality metric so SEO shops can't figure it out and capitalise on it.

replies(3): >>45954581 #>>45954763 #>>45955477 #

zppln ◴[17 Nov 25 15:43 UTC] No.45954581[source]▶

>>45954345 #

I feel at this point you'd almost be better off hand-curating a set of domains and only crawl those.

replies(1): >>45956058 #

1. skeeter2020 ◴[17 Nov 25 17:55 UTC] No.45956058[source]▶

>>45954581 #

not sure if this was intentional, but everything old is new again; back to OH yahoo? or Craig's list?

replies(1): >>45956327 #

2. graemep ◴[17 Nov 25 18:23 UTC] No.45956327[source]▶

>>45956058 (TP) #

Not quite, in that you can curate domains but crawl all the urls on those domains.

I think SEO plam + AI slop is likely to lead us back to human curation.

↑