ORDERing by RANDOM on a large table requires a full scan, even with LIMIT. For tables that don’t fit in memory this is impractical and I/O shoots to 100%.
replies(4):
To determine which rows are in the sample, you need to consider only two columns: the weight and (a) primary key.
If your population lives in a traditional row-oriented store, then you're going to have to read every row. But if you index your weights, you only need to scan the index, not the underlying population table to identify the sample.
If your population lives in a column-oriented store, identifying the sample is fast, again because you only need to read two small columns to do it. Column stores are optimized for this case.