The story of data and analytics keeps evolving. As we look toward the ‘20s, the recognition of data as a company’s single most important resource is nothing new, though pundits debate where data is the new oil or the new water. Petabyte analytics have moved from the domain of fringe Internet companies to even moderate scale enterprises. Driven by widespread and easy-to-use technologies that store, access and analyze, “citizen data scientists” mine for insights within most companies, using tools like Python and R, battle-hardened SQL, but also next generation BI interfaces.
For how many years have we heard the phrase “big data” ringing in our ears? 2020 is the year big data fades into the background, but practical large scale analytics becomes common place—with “data lakes” front and center. Raise your hand if you’ve been spending countless days/weeks/months analyzing which cloud solution would be best for your company, because “Everyone’s moving to the cloud.” What if you don’t have to choose between cloud and on-premises solutions, but can have the best of both worlds? For further insight into these topics and more, please read on:
Big Data is truly dead, but the data lake looms large: In data processing, it is now well understood that Map/Reduce, and using Hive as a tool for Map/Reduce, was a poor way to process data. Analytics that required data techniques – like joins – could not be performed efficiently. Large-scale, feature-rich data warehouses, both cloud and on-premises, have improved radically to provide multi-petabyte scale using MPP architectures. That scale is made practical by pushing compute and data closer together, and with SQL’s expressive semantics and aggregations, allows innovative database optimizations. These realities killed “big data” as we knew it. However, one element of big data lives on – the data lake. Storage companies like NetApp and EMC are being challenged by cloud storage, which is radically cheaper. It is a well-considered best practice that data is a company’s crown jewels, and keeping it all – non-destructively – is key to long-term business success. Long live the Data Lake, goodbye Big Data.
Best-of-Breed cloud is coming – under the name of hybrid: Public cloud vendors have extortionately high prices. The public cloud makes sense for small- and medium-sized businesses because those businesses don’t have the scope to amortize their engineering spend and build and learn more complex architectures. Public clouds don’t make sense, however, for technology companies, and examples are easy to find. Companies like Bank of America have gone on record as saving two billion dollars per year by not using the public cloud. I talked recently to a messaging-as-a-service company that had literally killed competition by coming in at a lower price point to its customers – powered by a bare-metal hosting company. I also know an AI company that runs on servers they built and racked; this technology has been made “easy” with provisioning systems like Kubernetes, and the company’s existence itself would not be practical if they had to pay today’s steep public cloud compute prices.
A best-of-breed architecture requires identifying the building blocks within the technical stack, then selecting not from a single cloud vendor, but from a variety of service providers. Assumptions that a given cloud provider has the lowest or best prices, or that the cost of networking between clouds is prohibitive, will become less and less true. A technology company that buys data center space a few milliseconds from a cloud provider will be able to provide a service that is potentially cheaper and better than one of the large cloud providers. A “best-of-breed” architect will prove adept at understanding bundled service providers, infrastructure providers, application providers, and choosing the right path to achieve flexibility, price, and agility. For example, there is a storage startup that is challenging S3 and other public cloud stores – providing more interfaces, more cloud interconnects and lower prices. Do you choose that option? Do you split your analytics and operational tiers – moving analytics to public clouds and operational responses bare metal? Enter the world of "best-of-breed" cloud and “best-of-breed” architects.
Data exchanges are the exciting data trend – but must evolve to data services: Imagine you’re doing some research, and you need access to data sets. Or, you’d like to monetize the data you’re collecting. Enter the data exchange – where buyers and sellers meet! The elasticity of the cloud answers the call – with the ability to provision a database in moments. Cloud warehouse vendors like Yellowbrick Data rejoice in the increased demand for analytics. This trend seems exciting and new, with datasets that can be trivially loaded into a database using cloud methodologies at the click of a button. Friction will be radically reduced; anyone with an AWS account will be able to buy datasets.
What will stop this trend? High-value datasets won’t be available in these exchanges. The ability to buy and sell data is already with us, and has been for a decade. Those companies are data services, and they provide complex ownership rules, in-service analytics, and industry-specific but standardized data models. Financial services companies provide tick data at a cost already, weather companies provide real-time updates, mapping providers provide routes through API access; these data services are already profitable and important.
As exciting as general data exchanges seem, simply buying a dataset for $175 to pour into your cloud database will be a niche.
Database innovation will be linked to hardware improvements: The most exciting and innovative databases are leveraging hardware advances to bring about the next levels of price and performance. Intel Optane has proved compelling for operational databases such as Oracle Exadata X8M and Aerospike 4.8. Specialized analytic hardware powers the Yellowbrick Cloud, and Amazon has teased its AQUA caching layer with FPGA acceleration. The cloud enables this innovation. Cloud companies will roll forward their hardware plans without on-premises installations, meaning users will be able to trial innovative hardware easily and experience the power of innovation. “Best-of-breed” architectural thinking will allow a company to choose precisely the best element of the tech stack, enabling innovators to capture market share quicker than before. Companies will be running their databases on more and more specialized hardware without even realizing it!
AI is becoming a standard technique: Between random forests, linear regression, and other computational patterns, AI has become a standard technique. AI, like standard numeric techniques, is best done with compute close to data. This means the techniques of “big data” (separating compute and data) are a poor choice, just like they were for a majority of analytics. Running AI as code on a compute grid, or within a database, does not allow the kinds of optimizations that an AI framework (or an AI-centric query system) can provide. Relying on a python script that has to be scrutinized in order to optimize for different processor and network speeds and densities is woefully inefficient. The time is now to wrap the top 10 AI operations into high-level statement that allow parallel operations, horizontally scalable compute, and other techniques. This trend will come to fruition in the longer term, but in five years, we’ll wonder why custom code lasted so long in the AI space.
2020 will be a year dominated by data-driven innovations, and I, for one, can’t wait to see how it all plays out.
Brian Bulkowski, CTO at Yellowbrick Data, has deep industry expertise in distributed systems, databases, Flash storage, and high-performance networking. At Yellowbrick, he is in charge of both long-term technical strategy and product direction. Previously, Brian co-founded Aerospike, the NoSQL database company, and served as its CTO where he was responsible for growing the Aerospike customer base and building its technology roadmap.