Technical Brief:

Supercharge Cybersecurity with Netflow and Yellowbrick Data

Download the PDF

Overview

Effective cybersecurity hinges on automating risk detection, using predictive analytics to measure the risk of real-time activity against aggregated historical patterns But data volumes are now so massive that historical data is regularly archived to create space for new data, which means that less historical data is being analyzed quickly, so end results are less accurate or less timely. Enterprises are struggling for ways to collect Netflow data in near real time and analyze it against billions or trillions of rows of historical data without making a science project out of it. Enter Yellowbrick Data.

Yellowbrick offers a simple solution to the massive scale and latency requirements associated with production level Netflow analysis. Utilizing the latest hardware and software approaches, Yellowbrick can ingest Netflow and other data sets at line rate while still serving queries and other application needs. Yellowbrick can make the data available immediately and there are no indexes to create and manage. 2PB data sets are easily ingested and managed with a single Yellowbrick cluster occupying only 10U of rack space. Following are some key benefits.

Query IP Data Quickly and Efficiently

The Yellowbrick database treats IP addresses as their own type, similar to a number, and supports order and functions. This enables organizations to query ranges with greater than, less than, or between operators without the need to convert data types. For example:

select count(*) from netflow_ip where destination_ip >200.200.200.1

Rapidly Ingest, Scan, and Access Data

Yellowbrick can perform table scans at hundreds of gigabytes per second. A million-row table scan is a sub-second activity for Yellowbrick. Some other ways Yellowbrick improves data analytics is by enabling users to sort data into columns. Even after the initial sort, sorting continues as new data is added or inserted. By sorting IP addresses in order, organizations can significantly accelerate IP address lookups within a particular range. Yellowbrick also uses zone mapping to tell the database which blocks of the NVMe flash contain which ranges of data. This significantly accelerates scan times by reducing the amount of data that must be scanned. Yellowbrick even displays read efficiency to users to help with data layout and query planning.

Scale Easily to Massive Data Sets

The figure below shows the results of a query response time test as data set size grew from 500 million records up to 14 billion records. Querying 28x more records took just double the time.

netflow query on yellowbrick

Figure 1. Querying a 28x a larger dataset size took just double the time (.0122 seconds).

Load Data at 10 Million Rows per Second

Enterprise cybersecurity also struggles with loading massive volumes of data into the system to begin with. With Yellowbrick, organizations can achieve load times of ~150,000 rows a second using standard ODBC, JDBC, and ADO.net connectors. Using the Yellowbrick high-speed tool, ybload, organizations can load data directly over a 10G network, achieving load speeds of 10 million rows per second (about 1 GB/s). In the example below, a 308GB dataset loaded and was available for querying in about 5 minutes and 15 seconds.

netflow query on yellowbrick

Figure 1. Querying a 28x a larger dataset size took just double the time (.0122 seconds).

Yellowbrick and Netflow
  • Query IP data quickly and efficiently
  • Rapidly ingest, scan, and access data
  • Scale easily to massive data sets
  • Load data at 10-million rows per second

Yellowbrick - The only modern data warehouse for hybrid cloud

Yellowbrick is the world’s only modern data warehouse for hybrid cloud. Enterprises rely on Yellowbrick to do the impossible in data analytics: get answers to the hardest business questions for improved profitability, better customer loyalty, and faster innovation in near real time, and at a fraction of the cost of alternatives.

Yellowbrick offers superior price/performance for thousands of concurrent users on petabytes of data, along with the unique ability to run analytic workloads on-premises, in a private cloud, and/or any public cloud and manage them in a simple, consistent way--all with predictable pricing via annual subscription.