Sorted and Clustered Tables

A Yellowbrick table may be sorted on one column or clustered on multiple columns (up to four). You cannot specify a sort column and cluster columns on the same table. After specifying one of these options, you cannot change the specification without dropping and re-creating the table.

A table that is sorted on one column facilitates the skipping of blocks when tables are scanned and the sorted column is restricted in the query.

A table that is clustered on multiple columns facilitates the skipping of blocks when tables are scanned and two or more cluster columns are restricted in the query.

You cannot sort or cluster on encrypted columns.

Note: Capacity expansion (adding blades to the cluster) results in table rows being rewritten and redistributed on the storage system. If any manual ordering was applied to the data before the expansion, there is no guarantee that the data will remain sorted in that order. In turn any query performance advantage associated with that order will be lost (such as skipping blocks during scans).

Sort Column

You can define a sort column for a Yellowbrick table. The SORT ON (column) clause names a single column to sort on before data is stored on the media. For example:
create table team
(teamid smallint, htid smallint, atid smallint, name varchar(30), nickname varchar(20), city varchar(20), stadium varchar(50), capacity int) 
distribute on (teamid)
sort on (city);

Clustered Tables

You can define up to four cluster columns for a Yellowbrick table. The CLUSTER ON (column, column, ...) clause names multiple columns to cluster on before data is stored on the media. For example:
create table team
(teamid smallint, htid smallint, atid smallint, name varchar(30), nickname varchar(20), city varchar(20), stadium varchar(50), capacity int) 
distribute on (teamid)
cluster on (name, nickname, city);

A clustered table maintains an internal map that organizes data as a set of points across the dimensions that are defined as cluster columns. This map optimizes query performance because the optimizer can skip the blocks of data that do not qualify for multiple restrictions. One pass over the data is sufficient to find the intersection of rows that meet the WHERE clause criteria, and a large amount of data in the base table can be eliminated very early in the query plan.

Note: Capacity expansion results in table rows being rewritten and redistributed on the storage system. If any manual ordering was applied to the data before the expansion, there is no guarantee that the data will remain sorted in that order. In turn any query performance advantage associated with that order will be lost.

When doing capacity expansion, we order the shards by the timestamps they were loaded and rewrite a batch (~8 shards to maximize drives b/w) at a time trying to preserve data temporal locality.

However, for rows within those batch of shards, their ordering in the newly rewritten shards will be random as we select from those shards in parallel. In the case of TM, they also need ordering on the org_id column which is done by resorting data by org_id at the end of each day. That ordering will likely be lost in the rewrite process if there are multiple unique org_ids in a batch of shards to rewrite. As a result, a given org_id will be scattered in multiple shards after rewrite which reduces skipping efficiency.

The restoring at the end of each day is user imposed ordering which is unknown to the YB system. YBD supports user define ordering in table DDL by using "SORT ON" and "PARTITION BY".