←back to thread

Sampling with SQL

(blog.moertel.com)
175 points thunderbong | 1 comments | | HN request time: 0s | source
Show context
tmoertel ◴[] No.41899091[source]
Hey! I'm tickled to see this on HN. I'm the author. If you have any questions, just ask. I'll do my best to answer them here.
replies(2): >>41904092 #>>41904525 #
indoordin0saur ◴[] No.41904092[source]
Thank you! The thing I find tricky here is choosing the weight. I think maybe one obvious way you would want to weight samples would be for recency. E.g. if I have a table of user login events then I care about seeing more of the ones that happened recently but still want to see some of the older ones. Would the algorithm still work if I converted a `created_at` timestamp to epoch time and used that? Or would I want to normalize it in some way?
replies(2): >>41904836 #>>41905441 #
1. tmoertel ◴[] No.41905441[source]
All of your `created_at` epoch times are going to be of similar magnitude, since they’re all about the same relative distance from the start of the epoch (usually 1970-01-01T00:00:00). So if you use them as weights directly, you’re in effect taking a uniform sample – every row has about the same chance of being chosen.

Based on what I glean about your scenario, I suspect you’d be better served by some kind of exponentially decaying weight based on age (i.e., age = current_timestamp ‒ created_at). For example, if you wanted to make your rows half as likely to be selected every time they grew another hour older, you could use `POW(2.0, -age / 3600)` as your weight, where age is given in seconds.

    duckdb> WITH Ages AS (
              SELECT age FROM UNNEST([0, 3600, 7200]) AS t(age)
            )
            SELECT age, POW(2.0, -age / 3600) AS w FROM Ages;
    ┌──────┬──────┐
    │ age  ┆ w    │
    ╞══════╪══════╡
    │    0 ┆  1.0 │
    │ 3600 ┆  0.5 │
    │ 7200 ┆ 0.25 │
    └──────┴──────┘