(blog.moertel.com)

178 points thunderbong | 1 comments | 20 Oct 24 10:58 UTC | HN request time: 0.199s | source

Show context

Horffupolde ◴[21 Oct 24 11:34 UTC] No.41903010[source]▶

ORDERing by RANDOM on a large table requires a full scan, even with LIMIT. For tables that don’t fit in memory this is impractical and I/O shoots to 100%.

replies(4): >>41904118 #>>41904953 #>>41906210 #>>41906816 #

mbb70 ◴[21 Oct 24 15:02 UTC] No.41904953[source]▶

>>41903010 #

How would you randomly sample a table without a full table scan? If each row must be equally likely to be sampled you must touch each row.

Unless you know the id range and select N ids ahead of time, then use an index to pull those. But I think the assumption of the article is you can't do that.

replies(2): >>41906268 #>>41907022 #

1. tmoertel ◴[21 Oct 24 18:37 UTC] No.41907022[source]▶

>>41904953 #

As I wrote in an earlier comment [1], you can take a sample in two parts, and the first part is the only one that needs to do a scan and, even then, if you index your weights (or are using a column store like Parquet), this scan can be very lightweight and ignore everything except the weights and primary keys.

[1] https://news.ycombinator.com/item?id=41906816

↑

Sampling with SQL