With Impala now GA, Cloudera’s CEO sizes up the SQL-on-Hadoop market

There is no shortage of confidence in the Hadoop space, and market leader Cloudera bolstered its own on Tuesday with the general availability of its Impala SQL query engine for Hadoop. And if CEO Mike Olson’s comments are any indication, we’re in for a long ride of competitive jockeying and oneupmanship as Cloudera and its peers go all Microsoft(s msft) or Google(s goog) and create myriad new data-processing engines to turn their Hadoop distributions into bona fide platforms.

Launched as a private beta in May 2012 and made public in October, Impala is Cloudera’s attempt to address the growing demand for interactive SQL analytics on Hadoop data. It’s essentially a massively parallel database designed to share the same storage platform and metadata as Hadoop MapReduce, only it is its own separate processing engine.

How Impala fits in
How Impala fits in

Impala actually uses the same “nearly ANSI” version of SQL as does current standard bearer Hive, but that technology (created by Facebook(s fb) in 2009 as a data warehouse layer for Hadoop) doesn’t run nearly fast enough to sate many users’ desire for interactive analytics. This is because Hive transforms SQL queries into MapReduce jobs, meaning every one is processed against the entire corpus of data in the Hadoop Distributed File System.

Sizing up the competition

Only Cloudera isn’t the first to have the idea, nor is it alone in trying to sell interactive SQL on Hadoop. The idea was first commercialized by Boston-based startup Hadapt in 2011, and is now being pushed by numerous startups and larger Hadoop players. Among them: Pivotal (formerly EMC(s emc)) Greenplum, MapR (with Drill), Hortonworks (with Stinger), Drawn to Scale, Splice Machine, Jethro Data and Citus Data.

Hadapt's architecture
Hadapt’s architecture

But Cloudera is arguably the biggest name pushing SQL on Hadoop, and CEO Mike Olson thinks Impala stands out for several reasons — not the least of which is that it exists as a product. “Nobody else is shipping production-grade SQL query support on Hadoop,” he told me during a recent call. “At least not in open source.” He seems content to let the startups do their things, instead focusing his attention on Cloudera’s big three Hadoop-distribution competitors in Pivotal, MapR and Hortonworks. Greenplum and Pivotal SVP Scott Yara was full of confidence — and R&D budget— when the company announced the Pivotal HD distribution and HAWQ technology in February, but Olson claims the approach requires a siloed DBMS within HDFS and is a “rearguard defensive strategy” to protect the company’s sunk costs in its database technology.

The Pivotal HD and Hawq architecture
The Pivotal HD and Hawq architecture

As for Hortonworks, Olson questions the wisdom of its Stinger initiative to boost Hive’s speed, noting that “Hive never got good while it was running standalone on MapReduce.” Hortonworks also partners with vendors such as Teradata to let their platforms access Hadoop data in its native format, but those approaches still require sending data over the network. “It’s not the way you would build it if you woke up in the 2000s and were building this anew,” Olson said.

The Stinger roadmap
The Stinger roadmap

Olson acknowledged that the MapR-led Apache Drill project is cut from the same cloth as Impala (that is, being a Google Dremel clone designed specifically for Hadoop), but “the difference is we’re shipping code.” Being generally available and ready for production workloads means Cloudera can lock down users and market share before many even have a chance to experiment with Drill. He all but dismissed questions over the readiness of Impala, spurred by rumblings in the Hadoop space that Cloudera rushed it into public beta in order to get on the scoreboard against more fully baked offerings.

“I don’t feel we’re under the gun competitively to pull it out of beta because no one else has product in the market,” Olson said. “I have no problems … calling this GA quality.” He did, however, acknowledge that Impala is shipping with a “minium viable feature set” that the company has plans to build on in the near future. Impala Senior Product Manager Justin Erickson noted a few issues of concern, including around the number of concurrent users Impala can support, but said they have been addressed during the beta period.

One piece of a larger platform

Really, though, the whole point of Impala and its competitors is to turn Hadoop from a tool for batch analytics and mass storage into a platform that can handle nearly all of companies’ data-processing needs. In that regard, it appears we’re just getting started. Cloudera, MapR, Pivotal Greenplum and Hortonworks are already pushing their own products and projects, and Olson said “it’s absolutely our intent” to enhance Cloudera’s platform with even more open-source products — perhaps even more database technologies a la HBase — that will let users do more stuff with more types of data. Over time, this strategy could result in Hadoop displacing the current breed of databases and data warehouses and becoming the single data store atop of which users run whatever applications they so desire. For now, though, especially when it comes to Impala and the data warehouse incumbents, Olson is taking a measured approach. “The likelihood that we’re going to knock them off in the near term,” he said, “… it would be a tough fight to win.”