Sampling with SQL

(blog.moertel.com)

178 points thunderbong | 1 comments | 20 Oct 24 10:58 UTC | HN request time: 0.252s | source

Show context

Horffupolde ◴[21 Oct 24 11:34 UTC] No.41903010[source]▶

ORDERing by RANDOM on a large table requires a full scan, even with LIMIT. For tables that don’t fit in memory this is impractical and I/O shoots to 100%.

replies(4): >>41904118 #>>41904953 #>>41906210 #>>41906816 #

tmoertel ◴[21 Oct 24 13:46 UTC] No.41904118[source]▶

>>41903010 #

Actually, on most modern SQL systems, especially those that deal with large datasets, the `ORDER/LIMIT` combination is implemented as a `TOP N` operation that consumes only O(N) memory.

replies(2): >>41904574 #>>41906229 #

swasheck ◴[21 Oct 24 14:30 UTC] No.41904574[source]▶

>>41904118 #

i don’t think the workspace memory is op point. if the initial set (not the intermediate or final projection) is too large, reads will come from disk which also creates cpu churn. at least in mssql, if the data page is not in memory it is read into the buffer pool and then read from the buffer bool into the workspace for processing. if there isn’t enough buffer space to fit all of the data set pages in the pool, you’re going to be reading and flushing pages. i think pg operates similarly.

replies(1): >>41905132 #

1. tmoertel ◴[21 Oct 24 15:19 UTC] No.41905132[source]▶

>>41904574 #

Yes, to sample a population, you must consider every row in the population. There's no way around it. But it doesn't have to be as slow and expensive as reading the entire population table.

To determine which rows are in the sample, you need to consider only two columns: the weight and (a) primary key.

If your population lives in a traditional row-oriented store, then you're going to have to read every row. But if you index your weights, you only need to scan the index, not the underlying population table to identify the sample.

If your population lives in a column-oriented store, identifying the sample is fast, again because you only need to read two small columns to do it. Column stores are optimized for this case.

↑