Sampling with SQL

(blog.moertel.com)

178 points thunderbong | 1 comments | 20 Oct 24 10:58 UTC | HN request time: 0.211s | source

Show context

emmelaich ◴[20 Oct 24 22:46 UTC] No.41899108[source]▶

Is there something in the SQL standard that says functions are guaranteed to executed more than once?

I swear that once I used something like random() and it was only executed once, making it useless for the task at hand. I had to use some trick to ensure it was executed for each row.

I may have used it in the `select` part. Dialect was Oracle's, from memory.

related: https://xkcd.com/221/

replies(5): >>41899141 #>>41899381 #>>41900027 #>>41901762 #>>41903922 #

hobs ◴[20 Oct 24 22:52 UTC] No.41899141[source]▶

>>41899108 #

It depends on the function and the SQL implementation, you can see in this simulator that where rand() > rand() evaluates row by row in MySQL but once in SQL Server, so its easy to get this stuff messed up even if the code is "equivalent" its really not.

https://onecompiler.com/mysql/42vq8s23b https://onecompiler.com/sqlserver/42vq8tz24

replies(1): >>41899285 #

emmelaich ◴[20 Oct 24 23:15 UTC] No.41899285[source]▶

>>41899141 #

Thanks, that's a bit upsetting :-)

replies(2): >>41899389 #>>41899741 #

tmoertel ◴[20 Oct 24 23:29 UTC] No.41899389[source]▶

>>41899285 #

Indeed.

On systems with unfortunate evaluation semantics for `RAND`, you can generate fresh random values for each row by creating a function for that purpose and calling it on the primary key of each row. I provide one example in the article at:

https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....

I'll include a copy here because it's short. It's for DuckDB and was created to let us generate a controllable number of fresh random values for each row:

    -- Returns a pseudorandom fp64 number in the range [0, 1). The number
    -- is determined by the given `key`, `seed` string, and integer `index`.
    CREATE MACRO pseudorandom_uniform(key, seed, index)
    AS (
      (HASH(key || seed || index) >> 11) * POW(2.0, -53)
    );

replies(2): >>41899564 #>>41910073 #

o11c ◴[21 Oct 24 00:07 UTC] No.41899564[source]▶

>>41899389 #

`HASH` looks like a slow function ... does something like `rand() + rowid & 0` or `((rand() * p53 + rowid) % p53) / p53` work?

replies(1): >>41899938 #

1. tmoertel ◴[21 Oct 24 01:32 UTC] No.41899938[source]▶

>>41899564 #

Generally, table scans dominate the cost of sampling, so evaluating a "slow" function once per row doesn't matter. What does matter is whether you can push filtering expressions down into the scans to eliminate I/O and decoding work early. Some systems have trouble pushing down RAND efficiency, which can make alternatives like the deterministic function I shared advantageous.

↑