←back to thread

Sampling with SQL

(blog.moertel.com)
175 points thunderbong | 2 comments | | HN request time: 0.592s | source
Show context
tmoertel ◴[] No.41899091[source]
Hey! I'm tickled to see this on HN. I'm the author. If you have any questions, just ask. I'll do my best to answer them here.
replies(2): >>41904092 #>>41904525 #
swasheck ◴[] No.41904525[source]
this is a really fascinating and interesting to me. i’ve been using sql to analyze large data sets recently since sql is my primary skillset and having new methods and algorithms is quite handy and interesting.

i do have a question of clarification. this is built on weighted sampling with weights in the dataset. does this indicate some sort of preprocessing to arrive at the weights?

replies(1): >>41904938 #
1. tmoertel ◴[] No.41904938[source]
The weights just indicate how important each row is to whatever you’re trying to estimate. Sometimes, your datasets will have numeric columns that naturally suggest themselves as weights. Other times, you may have to create weights.

For example, say you run a website like Wikipedia and want to estimate the percentage of page views that go to pages that are dangerously misleading. You have a dataset that contains all of your site’s pages, and you have logs that indicate which pages users have viewed. What you don’t have is a flag for each page that indicates whether it’s “dangerously misleading”; you’ll have to create it. And that’s likely to require a panel of expert human reviewers. Since it’s not practical to have your experts review every single page, you’ll want to review only a sample of pages. And since each page spreads misleading information only to the degree it’s viewed by users, you’ll want to weight each page by its view count. To get those counts, you’ll have to process the site logs, and count how many times each page was viewed. Then you can take a sample of pages weighted by those counts.

That’s a pretty typical scenario. You think about the question you’re trying to answer. You think about the data you have and the data you’ll need. Then you figure out how to get the data you need, hopefully without too much trouble.

replies(1): >>41906352 #
2. swasheck ◴[] No.41906352[source]
so that’s a “yes, in many cases it requires preprocessing which would occur prior to the scope of this article.”

thank you!