The Apache Hadoop technology stack is designed to process massive data sets on commodity servers, using parallelism of disks, processors, and memory across a distributed file system. Apache Impala (available commercially as Cloudera Data Warehouse) is a SQL-on-Hadoop solution that claims performant SQL operations on large Hadoop data sets.
Hadoop achieves optimal rotational (HDD) disk performance by avoiding random access and processing large blocks of data in batches. This makes it a good solution for workloads, such as recurring reports, that commonly run in batches and have few or no runtime updates. However, performance degrades rapidly when organizations start to add modern enterprise data warehouse workloads, such as:
Customers have migrated more and more of their data to Hadoop over time and have found a much greater need to support these workloads. SQL-on-Hadoop technologies such as Apache Hive and Impala were developed to provide this support. The problem is that these technologies are a SQL abstraction layer and do not operate optimally: while they allow users to execute familiar SQL statements, they do not provide high performance.
For example, classic database optimizations for storage and data layout that are common in the MPP warehouse world have not been applied in the SQL-on-Hadoop world. Although Impala has optimizations to enhance performance and capabilities over Hive, it must read data from flat files on the Hadoop Distributed File System (HDFS), which limits its effectiveness.
While Impala is a SQL layer on top of HDFS, the Yellowbrick hybrid cloud data warehouse is an analytic MPP database designed from the ground up to support modern enterprise workloads in hybrid and multi-cloud environments. Key architectural differences include:
One Impala customer turned to a Yellowbrick solution when increasing performance demands presented the company with a costly and complex Hadoop upgrade. The customer was already paying $500,000 per year in licensing and maintenance fees, and its Hadoop vendor’s recommendation was to double the size of the cluster, which would lead to more than $1 million per year in licensing and maintenance costs.
The Yellowbrick hybrid cloud data warehouse delivered far greater performance at a far lower cost (see figures 1 and 2).
600TB Usuable Capacity | Impala | Yellowbrick | Improvement |
---|---|---|---|
Number of Nodes | 50 | 15 | 3.3X |
System Size | 1.5 racks (100 rack units) | 6 rack units | 16.6X |
Total Memory | 37,500GB | 1,920GB | 19.5X |
Total Cores | 2,000 | 540 | 3.7X |
Figure 1. Yellowbrick architecture proves far more efficient than Impala in customer testing
Query Response Time % | Impala (ms) | Yellowbrick (ms) | Improvement |
---|---|---|---|
99 | 16,791 | 5,597 | 3X |
98 | 13,513 | 4,826 | 2.8X |
95 | 9,103 | 3,793 | 2.93X |
90 | 7,675 | 3,070 | 2.5X |
75 | 4,497 | 2,044 | 2.2X |
50 | 2,675 | 1,070 | 2.5X |
25 | 1,184 | 592 | 2X |
Avg. Response | 7,920 | 2,999 | 2.6X |
Figure 2. Response time percentile for 200,000 queries in customer testing.
As Figure 1 shows, to deliver the same 600TB of usable capacity, the Yellowbrick hybrid cloud data warehouse requires 16x less space, which reduces tile costs, and far less memory, which reduces power and cooling costs.
The Yellowbrick system also dramatically improves performance, enabling organizations to achieve faster, higher-quality insights. The example customer runs hundreds of thousands of unique queries per day, looking for anomalies. To compare Impala to Yellowbrick technology, the customer generated ~200,000 queries and measured performance as a response time percentile, in milliseconds.
Importantly, the superior price-to-performance will continue well into the future: with the Yellowbrick hybrid cloud data warehouse, the customer can increase performance 2.5x again simply by adding 15 nodes (for 30, total) in just 4U of additional rack space.
End users noticed performance improvements immediately when the customer replaced Impala with a Yellowbrick solution. Complex queries that used to take 30 seconds now execute in fewer than 2 seconds, and short queries that executed in several seconds are now measured in milliseconds — all while the Yellowbrick system constantly ingests terabytes of new data.
Customers can migrate data from a Hadoop environment into a Yellowbrick system by using any number of known Hadoop-based tools, including Apache Sqoop or Apache Spark. Some customers also use message queues such as Apache Kafka. With these tools, the Yellowbrick system looks and acts just like a PostgreSQL database, utilizing the standard ODBC and JDBC drivers. In fact, the Yellowbrick system uses PostgreSQL ODBC and JDBC drivers for any tooling connectivity.
Customers seeking a more robust feature set and faster data delivery can use the Yellowbrick high-speed tools ybload and ybrelay for loading data. ybload is a Java-based tool that can parallel load directly to the Yellowbrick system nodes. ybrelay is a Java-based listener that can spawn parallel ybload jobs for the fastest loads from Hadoop. ybrelay performs optimally with Spark by using Yellowbrick’s Spark SDK. Yellowbrick systems can access all data from Hadoop via Spark to ingest JSON, Parquet, Avro, or any other file type; perform any data enrichment; and format the data into CSV for loading via ybrelay.
Figure 3. ybrelay listens for Spark or Sqoop jobs and sends data to the Yellowbrick system via ybload.
ybrelay is installed and configured on “edge nodes” and listens for a Spark or Sqoop job (see Figure 3). Once a job completes and is ready to write, it requests a ybload job from ybrelay. ybrelay configures four pipes for data, which feed into four ybload jobs, handle the job termination, and commit signaling to ybload. With this parallel functionality, bulk data movement into Yellowbrick systems can happen at wire speed—in excess of 10TB/hour.
Apache NiFi is a tool for data routing and data flow that enables data to move globally across IT ecosystems. Data flows have numerous sources and targets, and Yellowbrick hybrid cloud data warehouses can easily be configured into NiFi pipelines. The Yellowbrick system can be a powerful tool for ingesting, cleaning, storing, and archiving data as well as a powerful SQL-based analytics tool for reporting and data discovery.
NiFi incorporates the Yellowbrick hybrid cloud data warehouse either as a source or a target. As a source, the Yellowbrick system receives data from NiFi via the PutSQL command. It uses the same JDBC driver as PostgreSQL and is configured the same as a PostgreSQL database. The system uses two storage engines: a row store and a column store. JDBC writes flow to the row store, which is perfect for trickle feeds of data less than 1GB. Bulk writes are written to the column store, using the Yellowbrick ybload tool. Customers who need to write bulk data volumes to their Yellowbrick cloud data warehouse should use ybload.
NiFi can source data from a Yellowbrick system by using three common statements: ExecuteSQL, QueryDatabaseTable, and GenerateTableFetch. The ExecuteSQL statement returns the result set of a SQL Select statement executed on the Yellowbrick system. When combined with the ability to scan hundreds of trillions of rows in seconds, a Yellowbrick hybrid cloud data warehouse can provide blazing access speeds to petabyte-scale data sets. This opens the door to deep insights from more data, making Yellowbrick faster in NiFi environments than any other solution available.
The Yellowbrick hybrid cloud data warehouse offers a simple, scalable, high-performance, cost-effective MPP alternative to Impala and other SQL-on-Hadoop offerings. In addition, it automates many common tuning and maintenance tasks, allowing your staff to put more focus on delivering business value, and is easy to install and use. You can deploy today and have results tomorrow.