←back to thread

119 points tosh | 3 comments | | HN request time: 0s | source
Show context
kwillets ◴[] No.42163786[source]
One possible hand optimization is to push the aggregation below the joins, which makes the latter a few hundred rows.
replies(1): >>42165295 #
1. kwillets ◴[] No.42165295[source]
To follow up, it's about 3x faster:

  SELECT 
      pickup.zone AS pickup_zone,
      dropoff.zone AS dropoff_zone,
      cnt AS num_trips
  FROM
      (select pickup_location_id, dropoff_location_id, count(*) as cnt from taxi_data_2019 group  by 1,2) data
  INNER JOIN
      (SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') pickup
      ON pickup.LocationID = data.pickup_location_id
  INNER JOIN
      (SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') dropoff
      ON dropoff.LocationID = data.dropoff_location_id
  
  ORDER BY num_trips desc
  LIMIT 5;
  ┌───────────────────────┬───────────────────────┬───────────┐
  │      pickup_zone      │     dropoff_zone      │ num_trips │
  │        varchar        │        varchar        │   int64   │
  ├───────────────────────┼───────────────────────┼───────────┤
  │ Upper East Side South │ Upper East Side North │    536621 │
  │ Upper East Side North │ Upper East Side South │    455954 │
  │ Upper East Side North │ Upper East Side North │    451805 │
  │ Upper East Side South │ Upper East Side South │    435054 │
  │ Upper West Side South │ Upper West Side North │    236737 │
  └───────────────────────┴───────────────────────┴───────────┘
  Run Time (s): real 0.304 user 1.791931 sys 0.132745
(unedited query is similar to theirs, about .9s)

And, doing the same to the optimized query drops it to about .265s.

replies(1): >>42166903 #
2. Sesse__ ◴[] No.42166903[source]
AFAIK there are rather few query optimizers that can do this; it's not an easy problem, in part because your search space explodes. There's a 2014 paper that addresses this (https://madoc.bib.uni-mannheim.de/37228/1/main.pdf) and some other papers that combine grouping and join into a “groupjoin” operation; the main problem is that it only really seems to scale to planning small queries (less than eight joins or so, IIRC).
replies(1): >>42167569 #
3. kwillets ◴[] No.42167569[source]
I've never run across it, but I would describe the job of an analytics engineer as doing this over and over for analysts -- it's probably semantically clearer to do the join first and then aggregate, so I end up pushing it down for them.

Thanks for the reference; this question has been on my mind for some time.