←back to thread

Sampling with SQL

(blog.moertel.com)
175 points thunderbong | 3 comments | | HN request time: 0.614s | source
Show context
Horffupolde ◴[] No.41903010[source]
ORDERing by RANDOM on a large table requires a full scan, even with LIMIT. For tables that don’t fit in memory this is impractical and I/O shoots to 100%.
replies(4): >>41904118 #>>41904953 #>>41906210 #>>41906816 #
tmoertel ◴[] No.41904118[source]
Actually, on most modern SQL systems, especially those that deal with large datasets, the `ORDER/LIMIT` combination is implemented as a `TOP N` operation that consumes only O(N) memory.
replies(2): >>41904574 #>>41906229 #
1. crazygringo ◴[] No.41906229[source]
But only when operating on fields that are indexed.

If you pull the highest 5 product ID's sorting by product ID, you're right it only reads those 5 entries from disk.

But when you ORDER BY RANDOM(), it's forced to do a full-table scan, because RANDOM() can't be indexed.

replies(1): >>41906464 #
2. tmoertel ◴[] No.41906464[source]
> But only when operating on fields that are indexed.

No, even without indexes, you only need O(N) memory to find the top N records under any ordering, including on random sort keys. You do have to examine every row, but if you are using a column store or have indexed the weight column, the scan will require only a small fraction of the work that a read-everything table scan would require.

replies(1): >>41906628 #
3. crazygringo ◴[] No.41906628[source]
Oh, sorry I read your comment quickly and hadn't realized you were talking about memory solely.

But that's because memory usage isn't the relevant bottleneck here -- it's disk IO. It's reading the entire table from disk, or only the entire column(s) if applicable.

That's not something you ever want to do in production. Not even if it's only a single column. That will destroy performance on any reasonably large table.

If a full-table scan of all columuns takes 180 seconds, then a "small fraction" to scan a single column might take 10 seconds, but queries running on a production database need to be measured in small numbers of milliseconds.