Sampling with SQL | slacker news

The weights just indicate how important each row is to whatever you’re trying to estimate. Sometimes, your datasets will have numeric columns that naturally suggest themselves as weights. Other times, you may have to create weights.

For example, say you run a website like Wikipedia and want to estimate the percentage of page views that go to pages that are dangerously misleading. You have a dataset that contains all of your site’s pages, and you have logs that indicate which pages users have viewed. What you don’t have is a flag for each page that indicates whether it’s “dangerously misleading”; you’ll have to create it. And that’s likely to require a panel of expert human reviewers. Since it’s not practical to have your experts review every single page, you’ll want to review only a sample of pages. And since each page spreads misleading information only to the degree it’s viewed by users, you’ll want to weight each page by its view count. To get those counts, you’ll have to process the site logs, and count how many times each page was viewed. Then you can take a sample of pages weighted by those counts.

That’s a pretty typical scenario. You think about the question you’re trying to answer. You think about the data you have and the data you’ll need. Then you figure out how to get the data you need, hopefully without too much trouble.