Sampling with SQL | slacker news

1. Horffupolde ◴[21 Oct 24 11:34 UTC] No.41903010[source]▶

ORDERing by RANDOM on a large table requires a full scan, even with LIMIT. For tables that don’t fit in memory this is impractical and I/O shoots to 100%.

replies(4): >>41904118 #>>41904953 #>>41906210 #>>41906816 #

2. tmoertel ◴[21 Oct 24 13:46 UTC] No.41904118[source]▶

>>41903010 (TP) #

Actually, on most modern SQL systems, especially those that deal with large datasets, the `ORDER/LIMIT` combination is implemented as a `TOP N` operation that consumes only O(N) memory.

replies(2): >>41904574 #>>41906229 #

3. swasheck ◴[21 Oct 24 14:30 UTC] No.41904574[source]▶

>>41904118 #

i don’t think the workspace memory is op point. if the initial set (not the intermediate or final projection) is too large, reads will come from disk which also creates cpu churn. at least in mssql, if the data page is not in memory it is read into the buffer pool and then read from the buffer bool into the workspace for processing. if there isn’t enough buffer space to fit all of the data set pages in the pool, you’re going to be reading and flushing pages. i think pg operates similarly.

replies(1): >>41905132 #

4. mbb70 ◴[21 Oct 24 15:02 UTC] No.41904953[source]▶

>>41903010 (TP) #

How would you randomly sample a table without a full table scan? If each row must be equally likely to be sampled you must touch each row.

Unless you know the id range and select N ids ahead of time, then use an index to pull those. But I think the assumption of the article is you can't do that.

replies(2): >>41906268 #>>41907022 #

5. tmoertel ◴[21 Oct 24 15:19 UTC] No.41905132{3}[source]▶

>>41904574 #

Yes, to sample a population, you must consider every row in the population. There's no way around it. But it doesn't have to be as slow and expensive as reading the entire population table.

To determine which rows are in the sample, you need to consider only two columns: the weight and (a) primary key.

If your population lives in a traditional row-oriented store, then you're going to have to read every row. But if you index your weights, you only need to scan the index, not the underlying population table to identify the sample.

If your population lives in a column-oriented store, identifying the sample is fast, again because you only need to read two small columns to do it. Column stores are optimized for this case.

6. crazygringo ◴[21 Oct 24 17:17 UTC] No.41906210[source]▶

>>41903010 (TP) #

Yup. For any kind of production app with a table that isn't tiny, you should absolutely never do an ORDER BY RANDOM().

You just need to make sure you're running these queries against a database used solely for occasional analytics, where it's no problem to be saturating the disk for two minutes or whatever, because it won't bother anybody else.

7. crazygringo ◴[21 Oct 24 17:20 UTC] No.41906229[source]▶

>>41904118 #

But only when operating on fields that are indexed.

If you pull the highest 5 product ID's sorting by product ID, you're right it only reads those 5 entries from disk.

But when you ORDER BY RANDOM(), it's forced to do a full-table scan, because RANDOM() can't be indexed.

replies(1): >>41906464 #

8. crazygringo ◴[21 Oct 24 17:24 UTC] No.41906268[source]▶

>>41904953 #

In a production database where queries must be fast, indeed you'll have to rely on constraints such as knowing the ID range and that ID's are contiguous, and do exactly as you say -- randomly generate ID's in that range and pull them with an index.

And indeed, if your data doesn't naturally have contiguous ID's, you might create an AUTOINCREMENT primary key precisely to create those contiguous ID's. And then of course if you have a reasonable but not overwhelming number of deletions, you can get away with retrying a new random number every time you don't get a hit, up to a max number of retries where you just return an error and ask the user to attempt the operation again.

replies(1): >>41906607 #

9. tmoertel ◴[21 Oct 24 17:44 UTC] No.41906464{3}[source]▶

>>41906229 #

> But only when operating on fields that are indexed.

No, even without indexes, you only need O(N) memory to find the top N records under any ordering, including on random sort keys. You do have to examine every row, but if you are using a column store or have indexed the weight column, the scan will require only a small fraction of the work that a read-everything table scan would require.

replies(1): >>41906628 #

10. ◴[21 Oct 24 17:57 UTC] No.41906607{3}[source]▶

>>41906268 #

11. crazygringo ◴[21 Oct 24 17:59 UTC] No.41906628{4}[source]▶

>>41906464 #

Oh, sorry I read your comment quickly and hadn't realized you were talking about memory solely.

But that's because memory usage isn't the relevant bottleneck here -- it's disk IO. It's reading the entire table from disk, or only the entire column(s) if applicable.

That's not something you ever want to do in production. Not even if it's only a single column. That will destroy performance on any reasonably large table.

If a full-table scan of all columuns takes 180 seconds, then a "small fraction" to scan a single column might take 10 seconds, but queries running on a production database need to be measured in small numbers of milliseconds.

12. tmoertel ◴[21 Oct 24 18:19 UTC] No.41906816[source]▶

>>41903010 (TP) #

As I mentioned in the "Tricks for faster samples" [1] section of the article, you can take a sample in two parts: (1) identify the rows in the sample and (2) read the data you want for the rows in the sample.

Only Part 1 requires running Algorithm A and its ORDER/LIMIT logic. And, when you run that logic, you only need to read two tiny columns from the table you want to sample: the weight and the primary key (any candidate key will work). So it looks like this:

    SELECT pk
    FROM Population
    WHERE weight > 0
    ORDER BY -LN(1.0 - RANDOM()) / weight
    LIMIT 1000  -- Sample size.

This operation will be fast whenever your data is stored in a column-oriented format or when you have indexed the weight.

I trust that you will accept as self evident that column stores are very fast when reading two small columns since they don't make you pay to read the columns you don't need.

So I'll focus on the row-store case. If you have indexed your weight, you basically have a tiny table of (weight, primary key) pairs sorted by weight. If you then run a query that needs to read only the weight and primary key, the query planner can fulfill your query by scanning the index, not the full population table. (The fact that the index is sorted doesn't matter. The query planner isn't taking advantage of the index ordering, just the fact that the index has everything it needs and is way smaller than the full population table. Most modern row-oriented database stores support index-only scans, including PostgreSQL [2] and SQLite [3].)

[1] https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....

[2] https://www.postgresql.org/docs/current/indexes-index-only-s...

[3] https://www.sqlite.org/optoverview.html#covering_indexes

13. tmoertel ◴[21 Oct 24 18:37 UTC] No.41907022[source]▶

>>41904953 #

As I wrote in an earlier comment [1], you can take a sample in two parts, and the first part is the only one that needs to do a scan and, even then, if you index your weights (or are using a column store like Parquet), this scan can be very lightweight and ignore everything except the weights and primary keys.

[1] https://news.ycombinator.com/item?id=41906816