Popular/hot comments

(www.wheresyoured.at)

Show context

gregw134 ◴[23 Apr 24 20:15 UTC] No.40136741[source]▶

Ex-Google search engineer here (2019-2023). I know a lot of the veteran engineers were upset when Ben Gomes got shunted off. Probably the bigger change, from what I've heard, was losing Amit Singhal who led Search until 2016. Amit fought against creeping complexity. There is a semi-famous internal document he wrote where he argued against the other search leads that Google should use less machine-learning, or at least contain it as much as possible, so that ranking stays debuggable and understandable by human search engineers. My impression is that since he left complexity exploded, with every team launching as many deep learning projects as they can (just like every other large tech company has).

The problem though, is the older systems had obvious problems, while the newer systems have hidden bugs and conceptual issues which often don't show up in the metrics, and which compound over time as more complexity is layered on. For example: I found an off by 1 error deep in a formula from an old launch that has been reordering top results for 15% of queries since 2015. I handed it off when I left but have no idea whether anyone actually fixed it or not.

I wrote up all of the search bugs I was aware of in an internal document called "second page navboost", so if anyone working on search at Google reads this and needs a launch go check it out.

replies(11): >>40136833 #>>40136879 #>>40137570 #>>40137898 #>>40137957 #>>40138051 #>>40140388 #>>40140614 #>>40141596 #>>40146159 #>>40166064 #

mrkeen ◴[24 Apr 24 07:20 UTC] No.40141596[source]▶

>>40136741 #

> There is a semi-famous internal document he wrote where he argued against the other search leads that Google should use less machine-learning, or at least contain it as much as possible, so that ranking stays debuggable and understandable by human search engineers.

There's a lot of ML hate here, and I simply don't see the alternative.

To rank documents, you need to score them. Google uses hundreds of scoring factors (I've seen the number 200 thrown about, but it doesn't really matter if it's 5 or 1000.) The point is you need to sum these weights up into a single number to find out if a result should be above or below another result.

So, if:

  - document A is 2Kb long, has 14 misspellings, matches 2 of your keywords exactly, matches a synonym of another of your keywords, and was published 18 months ago, and

  - document B is 3Kb long, has 7 misspellings, matches 1 of your keywords exactly, matches two more keywords by synonym, and was published 5 months ago

Are there any humans out there who want to write a traditional forward-algorithm to tell me which result is better?

replies(4): >>40141644 #>>40141688 #>>40144593 #>>40165827 #

1. datadeft ◴[24 Apr 24 07:34 UTC] No.40141688[source]▶

>>40141596 #

You do not need to. Counting how many links are pointing to each document is sufficient if you know how long that link existed (spammers link creation time distribution is widely differnt to natural link creation times, and many other details that you can use to filter out spammers)

replies(2): >>40141733 #>>40142033 #

2. raincole ◴[24 Apr 24 07:41 UTC] No.40141733[source]▶

>>40141688 (TP) #

> spammers link creation time distribution is widely differnt to natural link creation times

Yes, this is a statistical method. Guess what machine learning is and what it actually excels?

3. mrkeen ◴[24 Apr 24 08:44 UTC] No.40142033[source]▶

>>40141688 (TP) #

> You do not need to.

Ranking means deciding which document (A or B) is better to return to the user when queried.

Not writing a traditional forward-algorithm to rank these documents implies one of the following:

- You write a "backward" algorithm (ML, regression, statistics, whatever you want to call it).

- You don't use algorithms to solve it. An army of humans chooses the rankings in real time.

- You don't rank documents at all.

> Counting how many links are pointing to each document is sufficient if you know how long that link existed

- Link-counting (e.g. PageRank) is query-independent evidence. If that's sufficient for you, you'll always return the same set of documents to each user, regardless of what they typed into the search box.

At best you've just added two more ranking factors to the mix:

  - document A
    qie:
        length: 2Kb
        misspellings: 14
        age: 18 months
      + in-links: 4
      + in-link-spamminess: 2.31E4
    qde:
        matches 2 of your keywords exactly
        matches a synonym of another of your keywords

  - document B
    qie:
        length: 3Kb
        misspellings: 7
        age: 5 months
      + in-links: 2
      + in-link-spamminess: 2.54E3
    qde:
        matches 1 of your keywords exactly
        matches 2 keywords by synonym

So I ask again:

- Which document matches your query better, A or B?

- How did you decide that, such that not only can you program a non-ML algorithm to perform the scoring, but you're certain enough of your decision that you can fix the algorithm when it disagrees with you ( >> debuggable and understandable by human search engineers )

replies(3): >>40142262 #>>40146216 #>>40155577 #

4. datadeft ◴[24 Apr 24 09:24 UTC] No.40142262[source]▶

>>40142033 #

Statistical methods are debuggable. Is PageRank not debuggable? I am not sure where ML starts and statistics end.

5. srean ◴[24 Apr 24 16:18 UTC] No.40146216[source]▶

>>40142033 #

A few minor nitpicks. Pagerank is not just link counting, who is linking to the page matters. Among the linking pages those that are ranked higher matter more -- and how does one figure out their rank ? its by Pagerank. It may sound a bit like chicken and egg but its fine, its the fixed point of the self-referential. definition.

Pagerank based ranking will not return the same set of pages. Its true that the ranking is global in vanilla version of Pagerank, but what gets returned in rank order is the set of qualifying pages. The set of qualifying pages are very much query sensitive. Pagerank also depends on a seed set of initial pages, these may also be set on a query dependent way.

All this is a little moot now, because Pagerank even defined in this way stopped being useful a long time ago.

6. hongsy ◴[25 Apr 24 10:14 UTC] No.40155577[source]▶

>>40142033 #

What's qie and qde?

↑

The man who killed Google Search?