Yellowbrick Query Optimizer Team
In our previous post about using TPC-DS queries to illustrate how the Yellowbrick query optimizer works, we introduced TPC-DS Query 1 (Q01). This post tackles Query 2 (Q02), focusing on the subtle tricks required to handle the fact tables in TPC-DS.
TPC-DS splits its sales information into three different fact tables, an approach typical in the real/non-benchmark world, too. In TPC-DS, these tables are
catalog_sales—twice the size of
store_sales—twice the size of
catalog_sales(obviously, the schema was designed pre COVID-19)
Many queries in TPC-DS will
UNION these tables together in various ways. Q02 is the first example:
WITH wscs AS ( SELECT sold_date_sk, sales_price FROM ( SELECT ws_sold_date_sk sold_date_sk , ws_ext_sales_price sales_price FROM web_sales UNION ALL SELECT cs_sold_date_sk sold_date_sk , cs_ext_sales_price sales_price FROM catalog_sales ) AS x ) , wswscs AS ( SELECT d_week_seq , SUM(CASE WHEN (d_day_name = 'Sunday') THEN sales_price ELSE NULL END) sun_sales , SUM(CASE WHEN (d_day_name = 'Monday') THEN sales_price ELSE NULL END) mon_sales , SUM(CASE WHEN (d_day_name = 'Tuesday') THEN sales_price ELSE NULL END) tue_sales , SUM(CASE WHEN (d_day_name = 'Wednesday') THEN sales_price ELSE NULL END) wed_sales , SUM(CASE WHEN (d_day_name = 'Thursday') THEN sales_price ELSE NULL END) thu_sales , SUM(CASE WHEN (d_day_name = 'Friday') THEN sales_price ELSE NULL END) fri_sales , SUM(CASE WHEN (d_day_name = 'Saturday') THEN sales_price ELSE NULL END) sat_sales FROM wscs INNER JOIN date_dim ON d_date_sk = sold_date_sk GROUP BY d_week_seq ) SELECT d_week_seq1 , ROUND(sun_sales1 / sun_sales2, 2) , ROUND(mon_sales1 / mon_sales2, 2) , ROUND(tue_sales1 / tue_sales2, 2) , ROUND(wed_sales1 / wed_sales2, 2) , ROUND(thu_sales1 / thu_sales2, 2) , ROUND(fri_sales1 / fri_sales2, 2) , ROUND(sat_sales1 / sat_sales2, 2) FROM ( SELECT wswscs.d_week_seq d_week_seq1 , sun_sales sun_sales1 , mon_sales mon_sales1 , tue_sales tue_sales1 , wed_sales wed_sales1 , thu_sales thu_sales1 , fri_sales fri_sales1 , sat_sales sat_sales1 FROM wswscs INNER JOIN date_dim ON date_dim.d_week_seq = wswscs.d_week_seq WHERE d_year = 2001 ) y , ( SELECT wswscs.d_week_seq d_week_seq2 , sun_sales sun_sales2 , mon_sales mon_sales2 , tue_sales tue_sales2 , wed_sales wed_sales2 , thu_sales thu_sales2 , fri_sales fri_sales2 , sat_sales sat_sales2 FROM wswscs INNER JOIN date_dim ON date_dim.d_week_seq = wswscs.d_week_seq WHERE d_year = 2001 + 1 ) z WHERE d_week_seq1 = d_week_seq2 - 53::INT ORDER BY d_week_seq1 LIMIT 100 ;
Consider this part of the query:
WITH wscs AS ( SELECT ws_sold_date_sk sold_date_sk , ws_ext_sales_price sales_price FROM web_sales UNION ALL SELECT cs_sold_date_sk sold_date_sk , cs_ext_sales_price sales_price FROM catalog_sales )
Here, we construct a Common Table Expression (CTE) we will later join to
date_dim. But how should we estimate the cardinality of this join?
FROM wscs INNER JOIN date_dim ON d_date_sk = sold_date_sk
With statistics, it’s easy to estimate the cardinality of the
UNION ALL. However, to estimate the cardinality of the join itself, we have a few choices.
We know the number of distinct
sold_date_sk in each of the union branches. But even with very advanced statistics information, it is difficult to say how many distinct values the
UNION ALL of them has and what the overlap with the
date_dim actually is.
We know there is a foreign key from
d_date_sk and another foreign key from
d_date_sk. Because of this, we can know that the join to
date_dim of the
SUMof the two inputs
MINUSthe null rows on either
The second choice obviously provides the superior estimation method. Using primary/foreign key relationships such as this one is a key (pardon the pun) optimization for query planners: It rewards database designers who build good schemas and properly declare keys with a better query plan. This is why proper database design is very important for getting the most out of relational databases.
From an estimation perspective, this join is of particular interest:
FROM wswscs INNER JOIN date_dim ON date_dim.d_week_seq = wswscs.d_week_seq WHERE d_year = 2001
Here, we are joining on something that is not a key (
d_week_seq)—which in general means there is plenty of opportunity to get the estimate wrong.
How well we estimate this join depends a lot on what we can say about the following construct:
wswscs AS ( SELECT d_week_seq , SUM(CASE WHEN … etc.. FROM wscs JOIN date_dim ON d_date_sk = sold_date_sk GROUP BY d_week_seq )
Estimating aggregation is generally a difficult problem (because the cardinality of the output depends on the correlation of the input columns). However, for this particular join, we can be very confident that the cardinality is likely to be the same as the cardinality of
d_week_seq, because we are grouping on only a single column.
The second part of the join requires us to reason about the cardinality of
date_dim, given the filter of
d_year = 2001. Here, a two-dimensional histogram of (
d_week_seq, d_year) would be a great help. Unfortunately, the number of possible histograms you’d have maintain in this manner would grow so fast that you’d often run out of memory/disk to hold them. However, we can say that the number of distinct values can’t possibly exceed the number of rows estimated for
d_year = 2001—i.e., 365. That gives us a reasonable cap.
In Q01, we saw that although the same CTE is used multiple times in a query, it may not be a good idea to materialize the result. However, for Q02, we are looking at some very significant savings. Consider the following construct:
SELECT d_week_seq , SUM(CASE WHEN (d_day_name = 'Sunday') THEN sales_price ELSE NULL END) , … more columns …\ FROM wscs INNER JOIN date_dim ON d_date_sk = sold_date_sk GROUP BY d_week_seq
The output size of something grouped by
d_week_seq is obviously a tiny fraction of the input size of
wcsc (which you may recall is the
web_sales). Even with a high-speed column store, there is a bit to be gained by materializing this intermediate result, particularly because expressions such as the following are very expensive to evaluate:
SUM(CASE WHEN (d_day_name = 'Sunday') THEN sales_price ELSE NULL END)
Human beings really like strings; we find it easy to remember and work with them. However, machines pay a significant price when comparing strings. For a CPU, it’s orders of magnitude faster to compare integers than it is to compare strings. There are seven of these string compares for every row in
wcsc—which becomes billions of compares during query execution. So, it’s worthwhile to save the overhead of calculating these multiple times by materializing the intermediate result.
In this query, we don’t have a lot of choices with regard to join ordering. The dependencies between the joins force an order. The order is:
date_dim—materialize the result of
wswscs(for 2001 and 2002)