←back to thread

Sampling with SQL

(blog.moertel.com)
175 points thunderbong | 7 comments | | HN request time: 0.865s | source | bottom
1. tmoertel ◴[] No.41899091[source]
Hey! I'm tickled to see this on HN. I'm the author. If you have any questions, just ask. I'll do my best to answer them here.
replies(2): >>41904092 #>>41904525 #
2. indoordin0saur ◴[] No.41904092[source]
Thank you! The thing I find tricky here is choosing the weight. I think maybe one obvious way you would want to weight samples would be for recency. E.g. if I have a table of user login events then I care about seeing more of the ones that happened recently but still want to see some of the older ones. Would the algorithm still work if I converted a `created_at` timestamp to epoch time and used that? Or would I want to normalize it in some way?
replies(2): >>41904836 #>>41905441 #
3. swasheck ◴[] No.41904525[source]
this is a really fascinating and interesting to me. i’ve been using sql to analyze large data sets recently since sql is my primary skillset and having new methods and algorithms is quite handy and interesting.

i do have a question of clarification. this is built on weighted sampling with weights in the dataset. does this indicate some sort of preprocessing to arrive at the weights?

replies(1): >>41904938 #
4. swasheck ◴[] No.41904836[source]
depending on what you’re analyzing, EMA could give you a good weighting method
5. tmoertel ◴[] No.41904938[source]
The weights just indicate how important each row is to whatever you’re trying to estimate. Sometimes, your datasets will have numeric columns that naturally suggest themselves as weights. Other times, you may have to create weights.

For example, say you run a website like Wikipedia and want to estimate the percentage of page views that go to pages that are dangerously misleading. You have a dataset that contains all of your site’s pages, and you have logs that indicate which pages users have viewed. What you don’t have is a flag for each page that indicates whether it’s “dangerously misleading”; you’ll have to create it. And that’s likely to require a panel of expert human reviewers. Since it’s not practical to have your experts review every single page, you’ll want to review only a sample of pages. And since each page spreads misleading information only to the degree it’s viewed by users, you’ll want to weight each page by its view count. To get those counts, you’ll have to process the site logs, and count how many times each page was viewed. Then you can take a sample of pages weighted by those counts.

That’s a pretty typical scenario. You think about the question you’re trying to answer. You think about the data you have and the data you’ll need. Then you figure out how to get the data you need, hopefully without too much trouble.

replies(1): >>41906352 #
6. tmoertel ◴[] No.41905441[source]
All of your `created_at` epoch times are going to be of similar magnitude, since they’re all about the same relative distance from the start of the epoch (usually 1970-01-01T00:00:00). So if you use them as weights directly, you’re in effect taking a uniform sample – every row has about the same chance of being chosen.

Based on what I glean about your scenario, I suspect you’d be better served by some kind of exponentially decaying weight based on age (i.e., age = current_timestamp ‒ created_at). For example, if you wanted to make your rows half as likely to be selected every time they grew another hour older, you could use `POW(2.0, -age / 3600)` as your weight, where age is given in seconds.

    duckdb> WITH Ages AS (
              SELECT age FROM UNNEST([0, 3600, 7200]) AS t(age)
            )
            SELECT age, POW(2.0, -age / 3600) AS w FROM Ages;
    ┌──────┬──────┐
    │ age  ┆ w    │
    ╞══════╪══════╡
    │    0 ┆  1.0 │
    │ 3600 ┆  0.5 │
    │ 7200 ┆ 0.25 │
    └──────┴──────┘
7. swasheck ◴[] No.41906352{3}[source]
so that’s a “yes, in many cases it requires preprocessing which would occur prior to the scope of this article.”

thank you!