Partitioning Tables

Partitioning is a table design technique that significantly reduces resource consumption when queries join or aggregate rows. A partitioned table is stored in a number of separate pieces called partitions. Queries read and operate on individual partitions of the table one at a time instead of having to read and process the whole table at once, which reduces memory requirements and the likelihood of spilling. The reduction in memory requirements in turn enables more concurrency and better performance.

You set up the partitioning scheme for a table by declaring partition columns in the CREATE TABLE statement. You can partition on specific ranges of values, such as dates or numbers, or you can use a hash function to partition the data for you.

Partitioning is secondary to data distribution and does not change the distribution scheme defined in the CREATE TABLE statement. Partitions divide data within the worker node where data is already distributed.

For complete syntax and examples, see Partitioning Options.

To see how data is physically partitioned for a specific table that you have created and loaded, query the sys.table_partition view.

General Restrictions on Partitioning Tables

  • Up to four columns in a single table can be partitioned.
  • The number of partitions produced by a single specification must be no greater than 32767.
  • The product of the number of partitions produced by all specifications for the same table must be no greater than 250000. For example, the following table cannot be created:
    premdb=# create table hashtable(a int, b int) 
    partition by (hash(a with 3000 partitions), hash(b with 1000 partitions));
    ERROR:  the product of the number of partitions produced by all specifications results in 3000000 partitions, 250000 allowed
  • Replicated tables cannot be partitioned.
  • Partitioning schemes cannot be altered after a table is created. To change partitioning, you have to re-create the table.

Best Practices for Partitioning

This section suggests some general guidelines for partitioning tables. Whether a table benefits from partitioning is a function of several factors, including how the table is queried, how it is partitioned, and how much data is actually stored in each partition. Read the following sections to understand how these factors affect your partitioning choices.

Keep in mind that partitioning is an optional feature. The extent to which you use it may depend on specific performance expectations.

Which columns should be the partition columns?
Choose partition columns only after evaluating the following criteria. Partitioning on a table column is useful in three specific contexts:
  • When the data in the column can be sorted, and is frequently accessed by range using comparison or equality operators such as <, >, >=, or =.
  • When the column is commonly used alone in a GROUP BY clause.
  • When the column is frequently used in equi-joins.
In all of these cases, partitioning decreases the total memory and runtime required by applicable queries.

For example:

  • If you have time-series data and your queries group or join on date-time values, partition by the appropriate date or timestamp column. You can create one partition of data per some interval, such as day, week, or month.
  • If you join fact tables with other fact tables (large "fact-to-fact joins" or "data merges"), partition both fact tables in the same way on the primary and foreign key columns.
  • If you have a common logical unit that you report on in your fact tables – for example, the offices or stores that transactions took place in, partition by that unit as well.
  • If you have a very large dimension table that is often joined to a fact table, partition on the dimension surrogate key.
  • If you have a column in the fact table that is often used for filtering, partition on that column (for example, currency codes or country codes).

In general, all fact tables should be partitioned. Small dimension tables do not need to be partitioned, and tables that are less than 10GB in size may not benefit from partitioning at all.

When to use RANGE partitioning
Consider the following points:
  • Is RANGE partitioning supported for the column, given its data type? See the table in Partitioning Options.
  • Do you know reasonable upper and lower limits for the values in the column? In other words, do you know the domain of the column?
  • Do you often select from the column by using comparison operators, such as < or >?

If RANGE partitioning is available for the column type, and you know the domain of the column, it is always preferable to select the RANGE partition type. For example, you can partition on very specific ranges of values, such as dates or numbers.

Using a RANGE instead of a HASH partition can reduce the amount of data read when rows from the table are selected using comparison operators in a WHERE clause on the partitioned column. Using a HASH partition on the same column prevents this optimization.

Note: If you know the range of most of the data, but are unsure of the exact limits, be sure to include the OUTSIDE RANGE option. If the column may contain NULL values, be sure to specify IS NULL, or you will be unable to load any rows that contain NULL. See Partitioning Options.
When to use HASH partitioning

In all cases where RANGE partitioning does not apply, use HASH partitioning. For example, you can hash-partition columns that contain character strings, phone numbers, and IP addresses. For HASH partitioning, you do not need to specify how the partitions are defined, just how many partitions to create.

Choosing an optimal number of partitions

You also need to think about how many partitions your table will contain. A partitioned table is like a cube with individual partition columns as its dimensions. When you select a partition column, you define, implicitly or explicitly, a number of resulting partitions for that column. In turn, that number of partitions is multiplied by the number of partitions for other columns on the same table. The product of this calculation (the total number of "sub-partitions") is an important constraint on effective partitioning.

Consider the following points:
  • Although you can use up to four columns in the same table as partition columns, you should use as few as possible to produce the optimal number of partitions.
  • Partitioning schemes are always bounded by two limits:
    • The number of partitions per column (about 32,000)
    • The product of all partitioned columns per table (about 250,000).
  • The amount of data you expect to store in the table is directly related to the number of optimal partitions. For optimal performance, the table should be big enough and the product of the number of partitions small enough that each sub-partition is either completely empty or contains at least 2-10GB of row data.
  • Expected increases in the range of values in a column (future dates, for example), or the amount of data in a table may change your requirements over time. Partitioning cannot be updated at runtime; therefore, you need to create sufficient partitions ahead of time for business needs for the next few years. For example, if you are partitioning on a range of dates, the end of the range should extend well into the future.
Notes on bulk loads

Where possible, try to bulk load data one partition at a time. This means that your source data needs to be prepared in files that contain one partition of data each. For example, if the data is partitioned by range on dates with an interval of one day per partition, each data file should contain a day's worth of data.

If you cannot set up your loads in this way, query performance may be affected.

Some Example Calculations

The following examples demonstrate how to calculate an effective number of partitions on a single table:

Example 1

Say you intend to use three partition columns (c1, c2, c3) on a table (t1) that contains 1TB of data:
  • If the three columns evaluate to 5, 5, and 4 partitions each, the product of this partitioning scheme is 100 sub-partitions (5*5*4).
  • 1TB of data/100 sub-partitions = 10GB of data per partition, which is a good number. (It is fine if some of these partitions are empty.)

Example 2

Say you intend to use two partition columns (c1, c2) on a table (t2) that contains 10TB of data:
  • If the two columns evaluate to 100 and 1000 partitions each, the product of this partitioning scheme is 100,000 sub-partitions (100*1000).
  • 10TB of data/100000 sub-partitions = 100MB of data per partition, which is much too small.
  • You would need to think about reducing the number of partitions by using only one column or a different range, for example.

Example 3

Say you intend to use four partition columns (c1, c2, c3, c4) on a table (t3) that contains 20TB of data:
  • If the four columns evaluate to 100, 5, 8, and 200 partitions each, the product of this partitioning scheme is 800000 sub-partitions (10*5*8*200).
  • This number of partitions exceeds the system limit, and you will not be able to create the table at all. The CREATE TABLE statement will return an error.
  • If you partition on the first three columns only, the product will be 4000 (100*5*8).
  • 20TB of data/4000 sub-partitions = 5GB of data per partition, which is a reasonable number but on the low side.
  • It might be optimal to partition this table on only two columns. Pick the two that best fit the criteria described earlier in this section for selecting partition columns.