Most active commenters
  • emmelaich(3)

←back to thread

Sampling with SQL

(blog.moertel.com)
175 points thunderbong | 12 comments | | HN request time: 0.521s | source | bottom
1. emmelaich ◴[] No.41899108[source]
Is there something in the SQL standard that says functions are guaranteed to executed more than once?

I swear that once I used something like random() and it was only executed once, making it useless for the task at hand. I had to use some trick to ensure it was executed for each row.

I may have used it in the `select` part. Dialect was Oracle's, from memory.

related: https://xkcd.com/221/

replies(5): >>41899141 #>>41899381 #>>41900027 #>>41901762 #>>41903922 #
2. hobs ◴[] No.41899141[source]
It depends on the function and the SQL implementation, you can see in this simulator that where rand() > rand() evaluates row by row in MySQL but once in SQL Server, so its easy to get this stuff messed up even if the code is "equivalent" its really not.

https://onecompiler.com/mysql/42vq8s23b https://onecompiler.com/sqlserver/42vq8tz24

replies(1): >>41899285 #
3. emmelaich ◴[] No.41899285[source]
Thanks, that's a bit upsetting :-)
replies(2): >>41899389 #>>41899741 #
4. yen223 ◴[] No.41899381[source]
Postgres makes a distinction between IMMUTABLE, STABLE, and VOLATILE functions, with volatile functions being functions that - with the same arguments - can produce different results even within the same statement. Therefore VOLATILE functions will always be executed once per call.

I'm not sure if this is part of the ANSI SQL standard.

5. tmoertel ◴[] No.41899389{3}[source]
Indeed.

On systems with unfortunate evaluation semantics for `RAND`, you can generate fresh random values for each row by creating a function for that purpose and calling it on the primary key of each row. I provide one example in the article at:

https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....

I'll include a copy here because it's short. It's for DuckDB and was created to let us generate a controllable number of fresh random values for each row:

    -- Returns a pseudorandom fp64 number in the range [0, 1). The number
    -- is determined by the given `key`, `seed` string, and integer `index`.
    CREATE MACRO pseudorandom_uniform(key, seed, index)
    AS (
      (HASH(key || seed || index) >> 11) * POW(2.0, -53)
    );
replies(2): >>41899564 #>>41910073 #
6. o11c ◴[] No.41899564{4}[source]
`HASH` looks like a slow function ... does something like `rand() + rowid & 0` or `((rand() * p53 + rowid) % p53) / p53` work?
replies(1): >>41899938 #
7. hobs ◴[] No.41899741{3}[source]
Had a good laugh, this is the normal response to difference in SQL implementations in my experience.
8. tmoertel ◴[] No.41899938{5}[source]
Generally, table scans dominate the cost of sampling, so evaluating a "slow" function once per row doesn't matter. What does matter is whether you can push filtering expressions down into the scans to eliminate I/O and decoding work early. Some systems have trouble pushing down RAND efficiency, which can make alternatives like the deterministic function I shared advantageous.
9. paulddraper ◴[] No.41900027[source]
No.
10. magicalhippo ◴[] No.41901762[source]
Conversely, today() and friends might be evaluated per row and not per row, to the mild surprise of some of us.

Just reinforced my belief of not assuming you know what the DB server is doing, verify!

11. adornKey ◴[] No.41903922[source]
About a decade ago I tried to use random() and I found its behaviour to be very random... The changelogs of all databases I used later contained bug fixes on this subject. (Sybase, Microsoft)

I'm not sure if standards today are consistent and trustworthy on this subject. Back in the day it was very obvious that most likely nobody ever seriously had tried to use random().

12. emmelaich ◴[] No.41910073{4}[source]
Thanks. I think my trick was to generate (externally) a table of random numbers, then `join` by rowid that randtable to the table of interest. It was only a few million numbers so didn't take long to generate. And it was pretty much a one-off job.