Yellowbrick Query Optimizer Team
Query planning and execution is critical for data warehouse performance, and at Yellowbrick, we leave no stone unturned where performance optimization is concerned. In fact, a suboptimal approach to query optimization will almost always lead to suboptimal performance, regardless of the resources involved.
As you read in a previous post ("Life of a Yellowbrick Query"), when executing a query, Yellowbrick moves it through several finite states from submission to completion. It’s great if you understand this process so you can develop an effective workload management (WLM) strategy, especially with respect to the creation of WLM rules.
Over the next few months, we’ll be taking a deep dive into queries from the TPC-DS benchmark profile – with contributions from Yellowbrick's entire Query Optimization Team – to demonstrate how Yellowbrick handles them. (To see all the installments in this series, use the "Yellowbrick and TPC-DS" tag.) Query optimization is one of the harder problems in computer science and an evergreen topic for research, so we hope these discussions will contribute something worthwhile to that ongoing conversation.
So, let’s begin with TPC-DS Query 1 (Q01), a deceptively simple query that nevertheless provides ample opportunity for the optimizer to flex its muscles.
EXPLAIN (ANALYZE) WITH customer_total_return AS ( SELECT sr_customer_sk AS ctr_customer_sk , sr_store_sk AS ctr_store_sk , SUM(SR_FEE) AS ctr_total_return FROM store_returns INNER JOIN date_dim ON sr_returned_date_sk = d_date_sk WHERE d_year = 2000 GROUP BY sr_customer_sk, sr_store_sk ) SELECT c_customer_id FROM customer_total_return ctr1 INNER JOIN store ON s_store_sk = ctr1.ctr_store_sk INNER JOIN customer ON ctr1.ctr_customer_sk = c_customer_sk WHERE ctr1.ctr_total_return > ( SELECT AVG(ctr_total_return) * 1.2 FROM customer_total_return ctr2 WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk ) AND s_state = 'TN' ORDER BY c_customer_id LIMIT 100 ;
At first glance, this join seems like a simple one to estimate:
FROM store_returns JOIN date_dim ON sr_returned_date_sk = d_date_sk WHERE d_year = 2000
date_dim are present in most of the TPC-DS queries, so we’d better get this right.
With histograms, it is easy to estimate that the filter
WHERE d_year = 2000 must result in 366 rows. But what about cardinality of the join itself? A technique common in query optimization is to reason the following way:
date_dim(201 years); we picked 366 of those.
store_returns, and although they are not uniformly distributed, sampling 366 of those via the foreign key
sr_returned_date_skis likely to result in something reasonably uniform.
date_dimis the same fraction of rows we take from
store_returns. Hence, we should expect 366 / 73,049 * 287,999,764 rows to come out of the join (approximately 1.4 million rows).
Unfortunately, we would be very wrong to reason in this manner. The actual value is 55 million, so our estimate is off by >40x. Is there something skewed about the data? Why doesn’t this method work?
Well, a cleverly designed trap has been set by the TPC-DS council, testing our ability to properly reason about a join such as this one. We can uncover the trap by running this query:
SELECT COUNT(DISTINCT sr_returned_date_sk) FROM store_returns
We get the result 2,003. Of the 73,049 possible dates in
date_dim, only 2,003 (~5.5 years) of data is actually present in the join key
store_returns. With HyperLogLog statistics (which we use in Yellowbrick), it is easy to estimate that the distinct number of values in
store_returns is actually 2,003—so the planner has all the information needed.
Our revised reasoning becomes:
date_dimwill probably be the ones that are actually in
store_returns. (That is, we are looking for data that is there, not for data that isn’t—because that’s what humans do.)
store_returns. Hence, we should expect 366 / 2,003 * 287,999,764 rows to come out of the join (approximately 52 million rows).
This leads to a much better estimate that gets us to 52 million as compared to the actual of 55 million. Not bad!
We can refine the estimate further by observing that the NULL rows in
store_returns can’t match (because NULL doesn’t match anything in an
INNER JOIN). But that is a minor optimization for this case.
Often, humans will prefer to express queries in a manner that is not optimal for execution. In Q01, we have one such example:
WHERE ctr1.ctr_total_return > ( SELECT AVG(ctr_total_return) * 1.2 FROM customer_total_return ctr2 WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk )
Notice that the subquery extracting the
AVG value is being correlated with the outer query via this expression:
`WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk`
So, what is really going on here? The user is asking us to evaluate:
ctr_store_sk in ctr1…
AVG(ctr_total_return) (over ctr_customer_sk in ctr2)of the matching
Can you see the rewrite? The SQL language is designed to do exactly this kind of thing—it’s called a join! What we want to emit is this:
FROM ctr1.ctr_total_return JOIN (SELECT ctr_store_sk , AVG(ctr_total_return) * 1.2 as avg_customer FROM customer_total_return ctr2 GROUP BY ctr_store_sk) AS ctr2 ON tr1.ctr_store_sk = ctr2.ctr_store_sk AND ctr1.ctr_total_return > ctr2.avg_customer
With the subquery nicely rewritten to a join, we can now pursue yet another optimization.
Our query filters on
`AND s_state = 'TN'`
Consulting the stats, we can see that only 41 rows out of 1,002 rows match. Just as with
date_dim, the overlap between
store isn’t complete, so the actual selectivity is only around 41 / 501. Still a very decent filter.
But once again, the TPC-DS council has set a trap for us. Notice that
store is not joined directly to
store_returns—but is joined indirectly via
FROM customer_total_return ctr1 INNER JOIN store ON s_store_sk = ctr1.ctr_store_sk
ctr1 is an aggregate:
SELECT sr_customer_sk AS ctr_customer_sk , sr_store_sk AS ctr_store_sk , SUM(SR_FEE) AS ctr_total_return FROM store_returns INNER JOIN date_dim ON sr_returned_date_sk = d_date_sk WHERE d_year = 2000 GROUP BY sr_customer_sk, sr_store_sk
However, we can’t possibly be interested in stores that don’t match the
s_state filter. In other words, we could decide to join to
store before we group by
sr_store_sk. And the filter of
store is worth a 10x reduction in rows.
We must now make a choice. Is it better to:
A) Remove 10x of the rows before grouping by moving the join to
store under the
ctr1 aggregate and harvest the filter on
B) Reduce the cardinality of the input rows to the join by first grouping on
sr_store_sk and then joining to
store—that is, execute the query the way it is written?
The right answer to this question depends numerous factors, including:
store_returnsand group them? (Answer in Yellowbrick: very, very fast, because we use a column store and we need to visit only two columns.)
It turns out that A is the right answer for the Yellowbrick execution engine: It is slightly faster.
ctr2 both refer to the same Common Table Expression (CTE; the
WITH part of the query). You might be tempted to first evaluate and store this expression either in memory or on disk:
SELECT sr_customer_sk AS ctr_customer_sk , sr_store_sk AS ctr_store_sk , SUM(SR_FEE) AS ctr_total_return FROM store_returns INNER JOIN date_dim ON sr_returned_date_sk = d_date_sk INNER JOIN store ON s_store_sk = sr_store_sk WHERE d_year = 2000 AND s_state = 'TN' GROUP BY sr_customer_sk, sr_store_sk
There are about 2.6 million rows in this expression, and the additional cost of building the memory structure and using it twice (as compared to just running the very small join two times) currently doesn’t make it worthwhile with Yellowbrick's fast scan engine.
In this query, join ordering is nearly trivial. We observe that
store both provide filters that can be applied to
customer table (which is large) has no useful filters, so we don’t want to join that until the absolute last minute.
Using the right estimation techniques as described above, combined with the pushdown of filters, we can see that the join order should be:
store—harvesting the filter on
s_statefor a 10x reduction.
d_yearfilter for another 5x reduction.
GROUP BY (sr_customer_sk, sr_store_sk).
c_customer_id, which must be returned.