←back to thread

67 points jacobobryant | 1 comments | | HN request time: 0.328s | source
Show context
vitus ◴[] No.43665000[source]
Hm, I was curious about the biased shuffle, since I was not expecting that particular shape.

It looks like we basically build up the list of bookmarks as follows [0]:

    x = rand()
    if x < p:
      element = first element
    else:
      element = random element
    return [element] + shuffle(rest of list, according to the original order)
In some sense, it's reminiscent of selection sort. Any particular reason for choosing this approach? One "obvious" downside is that this ends up being an O(n*k) algorithm because you end up building a new list each iteration. It's also harder for me to intuitively understand how shuffled the list is based on the parameter, other than at the two extremes.

[0] https://github.com/jacobobryant/yakread/blob/bff756c68d86a07...

I'm also curious about the interleaving approach described in the last section:

> If you’ve already scrolled past all your unread bookmarked items several times but you have a bunch of new subscription items, we should probably lean towards recommending the subscription items.

> I do this by comparing the two lists pairwise and selecting an item via weighted random choice based on how many times they’ve been previously skipped (i.e. scrolled past in the For You feed). e.g. if the first bookmark item has been skipped twice and the first subscription item has been skipped once, then there’ll be a 40% chance we select the subscription item and a 60% chance we select the bookmark item.

I thought we wanted to prefer the item that hadn't been skipped as many times (so, prefer the subscription item)?

replies(1): >>43666170 #
jacobobryant ◴[] No.43666170[source]
For the interleaving, yes we want to prefer the item that's been skipped fewer times. I got the wording backwards in the article; I'll fix that.

For shuffling, I was trying to come up with an approach that would recommend the top k items roughly the same amount regardless of how many total items are in the list. E.g. say you have 10 subscriptions that you really like--I want to have those be a reasonable portion of your recommendations whether you've subscribed to 100 other subs or 1000 other subs.

Contrast that to a weighted random shuffle where each subscription's weight is its affinity score and we sample them based on weight without regard to their order in the original list. That approach is much more influenced by the size of the total list, and my experience is the handful of subscriptions that I really liked were always drowned out by all the other "speculative" subscriptions I had accumulated in my account.

The computational complexity ends up being OK because we generally don't actually need to shuffle the whole list. I recommend items in batches of 30, so we just need to get that many items and then we can abort the shuffle. There probably is some more efficient way to implement this though.

During implementation I was mostly thinking of this as "sampling" rather than "shuffling" actually, and just ended up describing it as the latter when I wrote the post.

replies(2): >>43666700 #>>43672029 #
jacobobryant ◴[] No.43666700[source]
Also--I think the pseudo code you have isn't /quite/ correct. If x is greater than p, we don't immediately take a random element from the list; rather we go to the next element and generate a new x and repeat. I.e. with p=0.1, there's a 10% we immediately take the first item, and if we don't do that, then there's a 10% chance we immediately take the second item, etc. we only pick a completely random item as a fallback if we get to the end of the list without picking anything.
replies(1): >>43671891 #
1. vitus ◴[] No.43671891[source]
Ah, I overlooked the `some`. Interesting, that's even more skewed than I initially thought.