Optimizers: The Low-Key MVP

1. kwillets ◴[17 Nov 24 12:15 UTC] No.42163786[source]▶

One possible hand optimization is to push the aggregation below the joins, which makes the latter a few hundred rows.

2. kwillets ◴[17 Nov 24 17:04 UTC] No.42165295[source]▶

To follow up, it's about 3x faster:

  SELECT 
      pickup.zone AS pickup_zone,
      dropoff.zone AS dropoff_zone,
      cnt AS num_trips
  FROM
      (select pickup_location_id, dropoff_location_id, count(*) as cnt from taxi_data_2019 group  by 1,2) data
  INNER JOIN
      (SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') pickup
      ON pickup.LocationID = data.pickup_location_id
  INNER JOIN
      (SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') dropoff
      ON dropoff.LocationID = data.dropoff_location_id
  
  ORDER BY num_trips desc
  LIMIT 5;
  ┌───────────────────────┬───────────────────────┬───────────┐
  │      pickup_zone      │     dropoff_zone      │ num_trips │
  │        varchar        │        varchar        │   int64   │
  ├───────────────────────┼───────────────────────┼───────────┤
  │ Upper East Side South │ Upper East Side North │    536621 │
  │ Upper East Side North │ Upper East Side South │    455954 │
  │ Upper East Side North │ Upper East Side North │    451805 │
  │ Upper East Side South │ Upper East Side South │    435054 │
  │ Upper West Side South │ Upper West Side North │    236737 │
  └───────────────────────┴───────────────────────┴───────────┘
  Run Time (s): real 0.304 user 1.791931 sys 0.132745

(unedited query is similar to theirs, about .9s)

And, doing the same to the optimized query drops it to about .265s.

replies(1): >>42166903 #

3. Sesse__ ◴[17 Nov 24 20:26 UTC] No.42166903[source]▶

>>42165295 #

AFAIK there are rather few query optimizers that can do this; it's not an easy problem, in part because your search space explodes. There's a 2014 paper that addresses this (https://madoc.bib.uni-mannheim.de/37228/1/main.pdf) and some other papers that combine grouping and join into a “groupjoin” operation; the main problem is that it only really seems to scale to planning small queries (less than eight joins or so, IIRC).

replies(1): >>42167569 #

4. kwillets ◴[17 Nov 24 21:42 UTC] No.42167569{3}[source]▶

>>42166903 #

I've never run across it, but I would describe the job of an analytics engineer as doing this over and over for analysts -- it's probably semantically clearer to do the join first and then aggregate, so I end up pushing it down for them.

Thanks for the reference; this question has been on my mind for some time.