Sampling with SQL

(blog.moertel.com)

178 points thunderbong | 1 comments | 20 Oct 24 10:58 UTC | HN request time: 0.235s | source

Show context

rhymer ◴[21 Oct 24 04:39 UTC] No.41900725[source]▶

Be careful, the weight of Algorithm A by Efraimidis and Spirakis cannot be interpreted as the inclusion probability, and thus cannot be used in survey sampling to construct the Horvitz–Thompson estimator.

See "Remarks on some misconceptions about unequal probability sampling without replacement" by Yves Tillé.

Quoted from Tillé's conclusion: "There is a multitude of correct and fast method of sampling... there is no reason to use an incorrect method like weighted random sampling where we do not control the inclusion probabilities"

It's not clear to me how easy it is to implement the "multitude of corect and fast methods" in SQL, though. Would love to see some reference implementation.

replies(1): >>41902816 #

apwheele ◴[21 Oct 24 11:00 UTC] No.41902816[source]▶

>>41900725 #

Care to elaborate on this? So this post does not save the resulting weight, so you don't use that in any subsequent calculations. You would just treat the result as a simple random sample. So it is unclear why this critique matters.

replies(1): >>41904496 #

1. tmoertel ◴[21 Oct 24 14:22 UTC] No.41904496[source]▶

>>41902816 #

Yes, rhymer has hit on the primary tradeoff behind Algorithm A. While Algorithm A is fast, easy, and makes it possible to sample any dataset you can access with SQL, including massive distributed datasets, its draws from the population are not independent and identically distributed (because with each draw you make the population one item smaller), nor does it let you compute the inclusion probabilties that would let you use the most common reweighting methods, such as a Horvitz–Thompson estimator, to produce unbiased estimates from potentially biased samples.

In practice, however, when your samples are small, your populations are large, and your populations' weights are not concentrated in a small number of members, you can use Algorithm A and just pretend that you have a sample of i.i.d. draws. In these circumstances, the potential for bias is generally not worth worrying about.

But when you cannot play so fast and loose, you can produce unbiased estimates from an Algorithm A sample by using an ordered estimator, such as Des Raj's ordered estimator [1].

You could alternatively use a different sampling method, one that does produce inclusion probabilities. But these tend to be implemented only in more sophisticated statistical systems (e.g., R's "survey" package [2]) and thus not useful unless you can fit your dataset into those systems.

For very large datasets, then, I end up using Algorithm A to take samples and (when it matters) Des Raj's ordered estimator to make estimates.

(I was planning a follow up post to my blog on this subject.)

[1] The original papers are hard to come by online, so I suggest starting with an introduction like https://home.iitk.ac.in/~shalab/sampling/chapter7-sampling-v..., starting at page 11.

[2] https://cran.r-project.org/web/packages/survey/survey.pdf

↑