←back to thread

Sampling with SQL

(blog.moertel.com)
175 points thunderbong | 1 comments | | HN request time: 0s | source
Show context
Horffupolde ◴[] No.41903010[source]
ORDERing by RANDOM on a large table requires a full scan, even with LIMIT. For tables that don’t fit in memory this is impractical and I/O shoots to 100%.
replies(4): >>41904118 #>>41904953 #>>41906210 #>>41906816 #
1. tmoertel ◴[] No.41906816[source]
As I mentioned in the "Tricks for faster samples" [1] section of the article, you can take a sample in two parts: (1) identify the rows in the sample and (2) read the data you want for the rows in the sample.

Only Part 1 requires running Algorithm A and its ORDER/LIMIT logic. And, when you run that logic, you only need to read two tiny columns from the table you want to sample: the weight and the primary key (any candidate key will work). So it looks like this:

    SELECT pk
    FROM Population
    WHERE weight > 0
    ORDER BY -LN(1.0 - RANDOM()) / weight
    LIMIT 1000  -- Sample size.
This operation will be fast whenever your data is stored in a column-oriented format or when you have indexed the weight.

I trust that you will accept as self evident that column stores are very fast when reading two small columns since they don't make you pay to read the columns you don't need.

So I'll focus on the row-store case. If you have indexed your weight, you basically have a tiny table of (weight, primary key) pairs sorted by weight. If you then run a query that needs to read only the weight and primary key, the query planner can fulfill your query by scanning the index, not the full population table. (The fact that the index is sorted doesn't matter. The query planner isn't taking advantage of the index ordering, just the fact that the index has everything it needs and is way smaller than the full population table. Most modern row-oriented database stores support index-only scans, including PostgreSQL [2] and SQLite [3].)

[1] https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....

[2] https://www.postgresql.org/docs/current/indexes-index-only-s...

[3] https://www.sqlite.org/optoverview.html#covering_indexes