←back to thread

Sampling with SQL

(blog.moertel.com)
175 points thunderbong | 1 comments | | HN request time: 0s | source
Show context
Horffupolde ◴[] No.41903010[source]
ORDERing by RANDOM on a large table requires a full scan, even with LIMIT. For tables that don’t fit in memory this is impractical and I/O shoots to 100%.
replies(4): >>41904118 #>>41904953 #>>41906210 #>>41906816 #
mbb70 ◴[] No.41904953[source]
How would you randomly sample a table without a full table scan? If each row must be equally likely to be sampled you must touch each row.

Unless you know the id range and select N ids ahead of time, then use an index to pull those. But I think the assumption of the article is you can't do that.

replies(2): >>41906268 #>>41907022 #
1. tmoertel ◴[] No.41907022[source]
As I wrote in an earlier comment [1], you can take a sample in two parts, and the first part is the only one that needs to do a scan and, even then, if you index your weights (or are using a column store like Parquet), this scan can be very lightweight and ignore everything except the weights and primary keys.

[1] https://news.ycombinator.com/item?id=41906816