The Yellowbrick Query Optimizer (TPC-DS Q03)

TPC-DS And The Yellowbrick Query Optimizer (TPC-DS Q03)

May 20, 2021

Yellowbrick

TPC-DS Q02, which we covered in our previous post, presented us a new set of challenges that stressed the Yellowbrick optimizer. In this post, we’ll look at TPC-DS Query 3 (Q03), which seemingly gives us a break from the usual complexity:

    
     SELECT
    dt.d_year
    , item.i_brand_id         brand_id
    , item.i_brand            brand
    , SUM(ss_ext_sales_price) sum_agg
FROM date_dim dt
INNER JOIN store_sales ON dt.d_date_sk = store_sales.ss_sold_date_sk
INNER JOIN item ON store_sales.ss_item_sk = item.i_item_sk
WHERE item.i_manufact_id = 436
  AND dt.d_moy = 12
GROUP BY dt.d_year, item.i_brand, item.i_brand_id
ORDER BY dt.d_year, sum_agg DESC, brand_id
LIMIT 100
;

At first glance, nothing seems difficult about this vanilla star-join query with two simple filters (on i_manufact_id and d_moy). But nothing is ever that easy with TPC-DS!

Before we dive in, let’s establish some conventions to help us be concise in our reasoning:

A join between two tables (say: foo and bar) is written (foo * bar).
The cardinality of a table or column is nominated ||and ||, respectively. For example, there are 73,049 rows in date_dim, so |date_dim|= 73049.

 We will use the function nd() to denominate the number of distinct/unique values in an expression. For example: There are 12 distinct months in date_dim, so nd(d_moy) = 12.
 Subscripting a column name nominates a filter on that table. For example: d_date_sk_{(d_year = 2000)}is the set of d_date_sk after filtering by d_year = 2000. We can now express things such as this truth: nd(d_date_sk_{(d_year = 2000)}) = 366. And we can speak about the ratio of distinct values to total cardinality, as in:  |d_month_{(d_year = 2000)}| = 366
 nd(d_month_{(d_year = 2000)}) = 12

 Estimating d_moy = 12 and the (date_dim * store_sales) join
 Consider this join:

        
     FROM date_dim dt INNER JOIN store_sales ON dt.d_date_sk = store_sales.ss_sold_date_sk AND dt.d_moy = 12
    
   
 
 
 
   Recall that in our installment about Q01, we learned that date_dim contains a hidden trap: The number of primary key values is higher than number of foreign keys in the fact table (by about 40x). Because of this, selectivity is best estimated by dividing the number of selected rows from date_dim by the number of distinct values in the fact table. In Q01, we were joining with store_returns. For Q03, we join to store_sales instead, but the principle remains the same.
Using the Q01 estimation method and practicing the new notation, we get
|store_sales date_dim| = | |d_date_sk_{(d_moy = 12)}| / nd(ss_sold_date_sk) |store_sales|
From simple statistics, it is easy to see that
|d_date_sk_{(d_moy = 12)}| = 6200
nd(ss_sold_date_sk) = 1800
Thus:
|store_sales date_dim| ≈ 6200 / 1800 ≈ 3.4 |store_sales|
Obviously, this estimate is completely off. It would indicate that the join increases the number or rows. The issue here is that we selected more rows from date_dim than we have distinct rows in ss_store_sales_sk on the matching join key.
So, we must refine our reasoning from Q01 as follows: If the number of selected distinct values from d_date_sk is higher than the number of distinct values of ss_sold_date_sk, then it must be the case that the filter is operating on the non-overlapping set of keys that join the two tables. Hence, we should assume that the join yields a filter selectivity that is more in line with the selectivity of date_dim.
Using this line of reasoning, we instead estimate
|store_sales date_dim| = |d_date_sk_{(d_moy = 12)}| / nd(d_date_sk) |store_sales|
This turns out to be very close to the real value.
JOIN order
Using the extended algorithm for date_dim estimation, we can now establish the correct join order:
Join store_sales to item, harvesting the filter on i_manufact_id = 436
Then, join to date_dim, taking the filter on d_moy = 12
Finally, aggregate the results.

      
 
 
     PrevPreviousThe Golden Analytics Opportunity In Insurance 
  
 
  NextThe Cloud Is Dead, Long Live The CloudNext

     Related Posts
   Cloud
    
February 22, 2024
Why Private Data Cloud?
Read More
 
        
 February 16, 2023
A Data Warehouse for Data Apps
 
 
       
 January 9, 2023
Yellowbrick, Accelerated by Intel® Xeon®
 
 
 
 
 
 
 
 
      
 
 
   Meet Our Authors
 
 
             
 
 
   Alice Russell
 
 
 
 
 
 
          
 
 
   Allen Holmes
 
 
 
 
 
 
          
 
 
   Bob Rumsby
 
 
 
 
 
 
          
 
 
   Bryson Boatwright
 
 
 
 
 
 
          
 
 
   Christopher Edgar
 
 
 
 
 
 
          
 
 
   Ryan Atkinson
 
 
 
 
 
 
          
 
 
   David Tran
 
 
 
 
 
 
          
 
 
   Drew Gillies
 
 
 
 
 
 
          
 
 
   Gary West
 
 
 
 
 
 
          
 
 
   Hyoun Park
 
 
 
 
 
 
          
 
 
   Jason Snodgress
 
 
 
 
 
 
          
 
 
   Jim Peterson
 
 
 
 
 
 
          
 
 
   Keerti Garg
 
 
 
 
 
 
          
 
 
   Kevin Petrie
 
 
 
 
 
 
          
 
 
   Mark Cusack
 
 
 
 
 
 
          
 
 
   Matt Aslett
 
 
 
 
 
 
          
 
 
   Maxence Menager
 
 
 
 
 
 
          
 
 
   Mike Ferguson
 
 
 
 
 
 
          
 
 
   Neil Carson
 
 
 
 
 
 
          
 
 
   Nick Cox
 
 
 
 
 
 
          
 
 
   Paritosh Kulkarni
 
 
 
 
 
 
          
 
 
   Henry Cate
 
 
 
 
 
 
      
 
 
   Ramnath Sai Sagar
 
 
 
 
 
 
          
 
 
   Richard Soundy
 
 
 
 
 
 
          
 
 
   Morgane Rouvroy
 
 
 
 
 
 
          
 
 
   Tim Young
 
 
 
 
 
 
          
 
 
   Torben Mathiasen
 
 
 
 
 
 
          
 
 
   Umair Waheed
 
 
 
 
 
 
          
 
 
   Kurt Westerfeld
 
 
 
 
 
 
          
 
 
   Yellowbrick
 
 
 
 
 
 
 
 
   Get the latest Yellowbrick News & Insights
 
 
     
 
 
 
      Cloud
   Mark CusackFebruary 22, 2024
 Why Private Data Cloud?
 This blog post sheds light on user experiences with Redshift,...
 
 
  
 
  Competitors
   Ramnath Sai SagarDecember 13, 2023
 Data Brew: Redshift Realities & Yellowbrick Capabilities –...
 This blog post sheds light on user experiences with Redshift,...
 
 
  
 
  Cloud
   Umair WaheedDecember 12, 2023
 DBAs Face Up To Kubernetes
 DBAs face new challenges with Kubernetes, adapting roles in database...

Data Warehouse on Kubernetes

Product

Competitive Advantage

Blog

Migration Guide

Solutions

Industries

Product

TPC-DS And The Yellowbrick Query Optimizer (TPC-DS Q03)

Yellowbrick

Estimating d_moy = 12 and the (date_dim * store_sales) join

JOIN order

Related Posts

Meet Our Authors

Get the latest Yellowbrick News & Insights

Product

Solutions

Resources

Book a Demo

Learn More About the Only Modern Data Warehouse for Hybrid Cloud

TPC-DS And The Yellowbrick Query Optimizer (TPC-DS Q03)

Yellowbrick

Estimating d_moy = 12 and the (date_dim * store_sales) join

JOIN order

Related Posts

Get the latest Yellowbrick News & Insights

Search Our Data

Book a Demo

Learn More About the Only Modern Data Warehouse for Hybrid Cloud