Sampling with SQL | slacker news

Show context

Horffupolde ◴[21 Oct 24 11:34 UTC] No.41903010[source]▶

ORDERing by RANDOM on a large table requires a full scan, even with LIMIT. For tables that don’t fit in memory this is impractical and I/O shoots to 100%.

replies(4): >>41904118 #>>41904953 #>>41906210 #>>41906816 #

mbb70 ◴[21 Oct 24 15:02 UTC] No.41904953[source]▶

>>41903010 #

How would you randomly sample a table without a full table scan? If each row must be equally likely to be sampled you must touch each row.

Unless you know the id range and select N ids ahead of time, then use an index to pull those. But I think the assumption of the article is you can't do that.

replies(2): >>41906268 #>>41907022 #

crazygringo ◴[21 Oct 24 17:24 UTC] No.41906268[source]▶

>>41904953 #

In a production database where queries must be fast, indeed you'll have to rely on constraints such as knowing the ID range and that ID's are contiguous, and do exactly as you say -- randomly generate ID's in that range and pull them with an index.

And indeed, if your data doesn't naturally have contiguous ID's, you might create an AUTOINCREMENT primary key precisely to create those contiguous ID's. And then of course if you have a reasonable but not overwhelming number of deletions, you can get away with retrying a new random number every time you don't get a hit, up to a max number of retries where you just return an error and ask the user to attempt the operation again.

replies(1): >>41906607 #