Testing SQL Databases with a Random Query Generator

Monkeying With QA

June 10, 2022

Henry Cate

Quality Engineer

One of the biggest changes with software development in the last couple of decades is just how quickly new releases come out. New tools and processes speed up both feature development and bug fixing. Hand in hand with this is the need to find bugs faster.

At Yellowbrick, one of the tools we’ve developed is a random query generator. This tool has helped to rapidly flush out hundreds of bugs.

The Problem: Testing SQL Databases Is Hard

Testing SQL databases is hard for several reasons.

The syntax for a SELECT statement in SQL allows an infinite number of different queries. Here are some of the issues:

Each expression in the SELECT clause can be complex.
can be joined in a variety of ways.
WHERE clause filters allow for complex Boolean expressions.
SELECT statements have other optional clauses, each of which can be very complex.
Queries are combinations of all the above.

It’s not possible to exhaustively test every possible query.

The same data can be stored in the database in many different ways. For example, a table may be range-partitioned or hash-partitioned. Depending on the structure of a given table, a database will generate different plans.

A plan is the way a database will process the data. Many queries have a large number of possible plans. Sometimes there are thousands, or millions, or even billions of different possible plans. An important job for a database is finding a good plan.

The type of data can also greatly influence which plan a database picks. The way the data is distributed influences which plans will be better. Some factors are:

How many unique values there are (for example, a small number of 50 states versus a large number of people’s names).
The distribution of the data. Is the data uniform or is the data in a bell-shape distribution? Is the data close together or scattered across a large range of possible values?
Are the join keys distribution keys?

The amount of data can also limit which plans are considered. Small tables might fit completely in memory, while large tables with terabytes of data might force a database to spill intermediate results to disk.

Other elements of queries that make it hard to test SQL databases include:

Set operations, such as UNION, EXCEPT, and INTERSECT.
Aggregations, such as SUM, AVERAGE, and so on.
Window functions.
Ordering of results.
Limits on the result set.

All the above elements can be combined in infinite ways. This makes testing the full matrix of possibilities impossible.

Our Approach: Generate Random SQL Statements

Manually creating queries takes time. It may take minutes or sometimes even hours to build a query. Sometimes, when you are testing a specific feature of a SQL database, it may make sense to craft a query that will exercise certain parts of the code. However, manually creating queries is expensive and slow.

Many people have heard of the infinite monkey theory. It states that if you put an infinite number of monkeys in front of typewriters, one of them will eventually “bash out the script for Hamlet.”

We can apply this idea of trying random things to testing SQL databases. A large set of random queries can flush out bugs that manual testing might never find. This approach still doesn’t cover the full matrix, but it can give us a lot more coverage. Each query has the potential to poke the database in a unique way, thus exposing a code path a QA engineer might never think to try.

Random Recursive Substitution

Much of a query is a set of expressions. For example:

    
     Select col1, col2 + 5, col3 * col4 from table1;

We can build random expressions by starting with a template and defining the legal options. For example:

    
     {integer expression} can be one of the following:
{integer literal value}
{integer column}
({integer expression} {some math operator} { integer expression})
{a function producing an integer value}
{a case statement producing an integer value}
{a query producing an integer vale}

…and so on. We can then start with a basic template of one query. For example:

    
     select {integer expression}, {integer expression}
from table1
where {boolean expression};

Then, using random recursion, we could produce billions of random queries. Here is a sample of the queries that might be created:

    
     select 1, abs(int_col2 + year(date_col2l))
from table1
where str_col1 < str_col2;

select length(str_col2) + abs(int_col3), 5
from table1
where col_int1 in (3, 4, 5, 77);

select (select max(int_col3) from table2), col_int2 – abs(col_in1)
from table1
where (col_in1, col_date2) in (
    select col_int2 + col_int3, date_col2
    from table2);

…and so on.

We use random recursion primarily with queries, but we also generate random INSERT, INSERT INTOSELECT, UPDATE, DELETE, CREATE VIEW, and CREATE TABLE AS SELECT statements. We’ve found that most of the bugs exposed this way happen with SELECT queries, but there have been a few bugs that would only occur in some other type of SQL statement.

Sample Queries

Here are two sample queries produced by our random query generator:

    
     (select  decimal_9_0 col_0
  FROM date_timestamp
  WHERE (null::ipv6 In  ('38d3:a412:5ab0:93ce:fffe:5237:32cc:7ecc'::ipv6))
)
MINUS
(select  decimal_18_18 col_0
  FROM date_timestamp_ran
  WHERE (varchar_10_nn In  (varchar_30, varchar_10_nn, varchar_10_nn, varchar_31_nn, varchar_31_all_null, varchar_30_nn, varchar_1, 'z'::char(13), varchar_1))
  ORDER BY col_bigint_nn)
MINUS
(select  decimal_18_18 col_0
  FROM date_timestamp_empty_ran
  ORDER BY 1)
EXCEPT
(select DISTINCT  100::decimal(5,0) col_0)
INTERSECT
(select DISTINCT  decimal_18_1 col_0
  FROM date_timestamp) ;

    
     select  (greatest (null::macaddr8) is distinct from ('46a42460:a96ad4d3'::macaddr8 | '89:5b:b9:89:5b:b9:aa:00'::macaddr8)) col_0, length (CURRENT_DATABASE()) col_1, convert(date, convert(varchar(50), (col_date_all_null - 5::bigint::int), 13), 13) col_2, radians (inv_norm(0.13, decimal_9_1, null::decimal(15,5))) col_3, (ntile(10) over (PARTITION BY LTRIM(varchar_10, 'a'), (NOT  EXISTS (
    select  col_uuid_all_null col_0, '123.124.125.126'::ipv4 col_1, '08:b3:66:fd:89:7a:17:09'::macaddr8 col_2, col_bool col_3, '83d3:a412:5ab0:93ce:fffe:5237:32cc:7ecc'::ipv6 col_4, '11 hours'::interval col_5
      FROM date_timestamp_view_transform
      WHERE sys.is_valid_varchar (col_int)
      LIMIT 44)), to_number ('' || col_int, '99999999999999999999999999999999999999'), (col_tswtz - '13 second'::interval), convert(timestamp, convert(varchar(50), col_timestamp_1, 101), 101) order by col_bool, col_timestamp_nn NULLS LAST)) col_4, (
        DECODE ('19:10:16'::time,
             '23:59:00'::time, '23:59:59'::time,
             col_time_same, col_time,
             '19:10:16'::time, col_time_same,
             col_time_nn, col_time_same,
             col_time_all_null, col_time_nn,
             col_time_all_null
        ) + '10 minutes'::interval) col_5
  FROM date_timestamp_ran
  ORDER BY 2 ASC, 4 USING >, 6 NULLS FIRST, 5, 1 NULLS LAST;

We have a number of knobs that allow us to increase the complexity to the point that we generate queries that go on for pages and pages.

Using a Query Generator

Many of our bugs come from situations where a query doesn’t work. A query may fail because of:

An assertion in the code.
An error message that should not happen for a valid query.
A fatal database error, and even a failure of the whole cluster.

We can also run random queries on our database and another database then compare the answers. This comparison has helped us to ensure correctness.

Benefits of the Query Generator

Our query generator helps us in a couple of important ways.

The first is the speed at which we can test a new feature. Often, as development is adding a new feature to our database, we can update the query generator to include the new feature in the random queries. That way when the feature is ready for testing, we can hit it with hundreds of thousands and even millions of different queries, often exposing dozens of bugs within the first week.

The second is just how many different code paths in the database get explored by these random queries. Sometimes a developer looking at a bug found by the query generator might comment, “I never thought we could get into this situation.”

Conclusion

Our query generator has greatly helped to improve the quality of our database. About 15% of all the SQL bugs that have been reported were found by this query generator. As new customers buy our database and existing customers buy more licenses, we are faced with new and unexpected queries. The random query generator helps us make sure that these new queries will work. This kind of testing improves the product and keeps our customers happy.