←back to thread

Sampling with SQL

(blog.moertel.com)
175 points thunderbong | 1 comments | | HN request time: 0s | source
Show context
emmelaich ◴[] No.41899108[source]
Is there something in the SQL standard that says functions are guaranteed to executed more than once?

I swear that once I used something like random() and it was only executed once, making it useless for the task at hand. I had to use some trick to ensure it was executed for each row.

I may have used it in the `select` part. Dialect was Oracle's, from memory.

related: https://xkcd.com/221/

replies(5): >>41899141 #>>41899381 #>>41900027 #>>41901762 #>>41903922 #
hobs ◴[] No.41899141[source]
It depends on the function and the SQL implementation, you can see in this simulator that where rand() > rand() evaluates row by row in MySQL but once in SQL Server, so its easy to get this stuff messed up even if the code is "equivalent" its really not.

https://onecompiler.com/mysql/42vq8s23b https://onecompiler.com/sqlserver/42vq8tz24

replies(1): >>41899285 #
emmelaich ◴[] No.41899285[source]
Thanks, that's a bit upsetting :-)
replies(2): >>41899389 #>>41899741 #
tmoertel ◴[] No.41899389[source]
Indeed.

On systems with unfortunate evaluation semantics for `RAND`, you can generate fresh random values for each row by creating a function for that purpose and calling it on the primary key of each row. I provide one example in the article at:

https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....

I'll include a copy here because it's short. It's for DuckDB and was created to let us generate a controllable number of fresh random values for each row:

    -- Returns a pseudorandom fp64 number in the range [0, 1). The number
    -- is determined by the given `key`, `seed` string, and integer `index`.
    CREATE MACRO pseudorandom_uniform(key, seed, index)
    AS (
      (HASH(key || seed || index) >> 11) * POW(2.0, -53)
    );
replies(2): >>41899564 #>>41910073 #
o11c ◴[] No.41899564{3}[source]
`HASH` looks like a slow function ... does something like `rand() + rowid & 0` or `((rand() * p53 + rowid) % p53) / p53` work?
replies(1): >>41899938 #
1. tmoertel ◴[] No.41899938{4}[source]
Generally, table scans dominate the cost of sampling, so evaluating a "slow" function once per row doesn't matter. What does matter is whether you can push filtering expressions down into the scans to eliminate I/O and decoding work early. Some systems have trouble pushing down RAND efficiency, which can make alternatives like the deterministic function I shared advantageous.