Sampling with SQL

(blog.moertel.com)

178 points thunderbong | 1 comments | 20 Oct 24 10:58 UTC | HN request time: 0.206s | source

Show context

tmoertel ◴[20 Oct 24 22:42 UTC] No.41899091[source]▶

Hey! I'm tickled to see this on HN. I'm the author. If you have any questions, just ask. I'll do my best to answer them here.

replies(2): >>41904092 #>>41904525 #

indoordin0saur ◴[21 Oct 24 13:43 UTC] No.41904092[source]▶

>>41899091 #

Thank you! The thing I find tricky here is choosing the weight. I think maybe one obvious way you would want to weight samples would be for recency. E.g. if I have a table of user login events then I care about seeing more of the ones that happened recently but still want to see some of the older ones. Would the algorithm still work if I converted a `created_at` timestamp to epoch time and used that? Or would I want to normalize it in some way?

replies(2): >>41904836 #>>41905441 #

1. tmoertel ◴[21 Oct 24 15:52 UTC] No.41905441[source]▶

>>41904092 #

All of your `created_at` epoch times are going to be of similar magnitude, since they’re all about the same relative distance from the start of the epoch (usually 1970-01-01T00:00:00). So if you use them as weights directly, you’re in effect taking a uniform sample – every row has about the same chance of being chosen.

Based on what I glean about your scenario, I suspect you’d be better served by some kind of exponentially decaying weight based on age (i.e., age = current_timestamp ‒ created_at). For example, if you wanted to make your rows half as likely to be selected every time they grew another hour older, you could use `POW(2.0, -age / 3600)` as your weight, where age is given in seconds.

    duckdb> WITH Ages AS (
              SELECT age FROM UNNEST([0, 3600, 7200]) AS t(age)
            )
            SELECT age, POW(2.0, -age / 3600) AS w FROM Ages;
    ┌──────┬──────┐
    │ age  ┆ w    │
    ╞══════╪══════╡
    │    0 ┆  1.0 │
    │ 3600 ┆  0.5 │
    │ 7200 ┆ 0.25 │
    └──────┴──────┘

↑