←back to thread

Sampling with SQL

(blog.moertel.com)
175 points thunderbong | 6 comments | | HN request time: 0.001s | source | bottom
Show context
emmelaich ◴[] No.41899108[source]
Is there something in the SQL standard that says functions are guaranteed to executed more than once?

I swear that once I used something like random() and it was only executed once, making it useless for the task at hand. I had to use some trick to ensure it was executed for each row.

I may have used it in the `select` part. Dialect was Oracle's, from memory.

related: https://xkcd.com/221/

replies(5): >>41899141 #>>41899381 #>>41900027 #>>41901762 #>>41903922 #
hobs ◴[] No.41899141[source]
It depends on the function and the SQL implementation, you can see in this simulator that where rand() > rand() evaluates row by row in MySQL but once in SQL Server, so its easy to get this stuff messed up even if the code is "equivalent" its really not.

https://onecompiler.com/mysql/42vq8s23b https://onecompiler.com/sqlserver/42vq8tz24

replies(1): >>41899285 #
1. emmelaich ◴[] No.41899285[source]
Thanks, that's a bit upsetting :-)
replies(2): >>41899389 #>>41899741 #
2. tmoertel ◴[] No.41899389[source]
Indeed.

On systems with unfortunate evaluation semantics for `RAND`, you can generate fresh random values for each row by creating a function for that purpose and calling it on the primary key of each row. I provide one example in the article at:

https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....

I'll include a copy here because it's short. It's for DuckDB and was created to let us generate a controllable number of fresh random values for each row:

    -- Returns a pseudorandom fp64 number in the range [0, 1). The number
    -- is determined by the given `key`, `seed` string, and integer `index`.
    CREATE MACRO pseudorandom_uniform(key, seed, index)
    AS (
      (HASH(key || seed || index) >> 11) * POW(2.0, -53)
    );
replies(2): >>41899564 #>>41910073 #
3. o11c ◴[] No.41899564[source]
`HASH` looks like a slow function ... does something like `rand() + rowid & 0` or `((rand() * p53 + rowid) % p53) / p53` work?
replies(1): >>41899938 #
4. hobs ◴[] No.41899741[source]
Had a good laugh, this is the normal response to difference in SQL implementations in my experience.
5. tmoertel ◴[] No.41899938{3}[source]
Generally, table scans dominate the cost of sampling, so evaluating a "slow" function once per row doesn't matter. What does matter is whether you can push filtering expressions down into the scans to eliminate I/O and decoding work early. Some systems have trouble pushing down RAND efficiency, which can make alternatives like the deterministic function I shared advantageous.
6. emmelaich ◴[] No.41910073[source]
Thanks. I think my trick was to generate (externally) a table of random numbers, then `join` by rowid that randtable to the table of interest. It was only a few million numbers so didn't take long to generate. And it was pretty much a one-off job.