Cloudera makes SQL a first-class citizen in Hadoop

Not content to watch its competitors leave it in the dust, veteran big data startup Cloudera is fundamentally changing the face of its flagship Hadoop distribution into something much more appealing. The company has developed a real-time SQL query engine called Impala that will sit aside MapReduce as a native processing option within Cloudera’s version of Hadoop. Cloudera is biggest and most well-known Hadoop vendor around, so opening its platform up to the wide world of SQL-trained data analysts is a really big deal — even if Cloudera is a bit late to the SQL party.

From batch processing to data interaction

The business world regularly laments the circumstances that spurred Impala’s creation. I summed them up last week and again yesterday when reporting similar products from startups Hadapt and Platfora, but the gist is that although Hadoop is more scalable and more flexible than traditional data warehouses or analytic databases, it’s also slower, harder to learn and designed for batch processing an entire data set rather than interactively querying a data set. Until now, the common methods for querying Hadoop were to use a custom-built language such as Hive, or to transport data to a data warehouse from Hadoop and then analyze it using traditional business intelligence software.

However, Cloudera’s Cloud VP of Products Charles Zedlewski was quick to point out during a recent conversation that Impala isn’t a replacement for other BI tools, just a new data source into which they can connect. If anything, it’s a replacement for Hive, which Facebook built to bring data warehouse capabilities to Hadoop, but which wasn’t really developed for public consumption as a software product. For the sake of uniformity, Impala actually uses the same SQL set as Hive, but is on average 10 times faster thanks to its purpose-built query engine that foregoes reliance on MapReduce. Small queries, Zedlewski said, can run in less than a second.

Impala has been in the making for almost two years, and Cloudera “took a a lot of pains to stitch this really well in with the rest of the Hadoop stack,” Zedlewski said. Users still store data in the Hadoop Distributed File System of the HBase database, and they can still store whatever types of structured, semi-structured on unstructured data they please. Impala uses the same metadata as the other Hadoop components, the same drivers and — like almost everything else in the Hadoop world — is open source under the Apache Software Foundation license.

Unlike some other Hadoop startups, though, Cloudera isn’t interested in selling BI or other analytic applications. Impala (which is called Real-Time Query for customers who pay for support) is the execution engine, but it still relies on software from Cloudera partners such as Tableau, QlikTech (s qlik) and MicroStrategy (s mstr) in order to ask questions and visualize the results. “We’re sticking to our knitting as a platform vendor,” said Zedlewski, echoing a sentiment on which his boss, Cloudera CEO Mike Olson, has been bullish for years.

Different strokes move the world

I can’t underscore enough how critical all of this innovation is for Hadoop, which in order to add substance to its unparalleled hype needed to become far more useful to far more users. But the sudden shift from Hadoop as a batch-processing engine built on MapReduce into an ad hoc SQL querying engine might leave industry analysts and even Hadoop users scratching their heads.

Cloudera, now with more than 300 employees and annual revenue rumored to be in hundreds of millions, is the 800-pound gorilla in the Hadoop market, and its implementation of Impala has to make it look even better for prospective customers. But Cloudera doesn’t have this space to itself. Assuming your goal is to use Hadoop as the platform for running SQL queries (as opposed to, for example, using it for ETL before putting it in an in-memory system), there are plenty of choices on the table. And everyone’s approach is different.

For starters, bitter distribution-level rival MapR announced in August that it’s leading an open source project called Drill that provides essentially the same functionality as Impala. MapR is getting a lot of love from Hadoop users right now, and a future implementation of Drill into its product lineup would add even more legitimacy. Not wanting to cede the innovation edge to Cloudera of MapR, one has to suspect Yahoo spinoff Hortonworks will also get into the query engine game at some point. (We’ll leave the debate over whether the myriad different flavors of Hadoop constitute the beginning of a community fracture for another day.)

Like Cloudera, however, if MapR and Hortonworks decide to integrate query engines in their products, they’ll likely rely on application providers to deliver the user experience on top. For better or worse, that presently means reliance on legacy vendors until startups can get familiar with the source code and start building BI products designed to take advantage of the new capabilities. When asked about Impala as a technology for disrupting the traditional data warehouse market, Cloudera’s Zedlewski noted that existing products are often very good at what they do.

“I think it’s highly unlikely that something like Impala would really be considered an alternative of that,” he said. Those vendors don’t seem to think so either, as companies like Teradata (s tdc) and EMC Greenplum (e emc) are telling always-improving stories about integrating their existing product lines with Hadoop.

Running a sentiment analysis in Tableau with Hadapt

On the other end of the spectrum are startups such as Hadapt, Platfora and Birst, which have built Hadoop-based query engines on their own, independent of loyalty to any particular Hadoop distribution. These companies have a lot of smart people on board, and their technologies are for real. Platfora CEO Ben Werther, in particular, makes no bones about his goal of unseating the BI incumbents with analytics applications built from the ground up to analyze big data stored in Hadoop.

Similar, although not necessarily competitive, technologies include Spire (from Drawn to Scale) and Splice Machine. Both support some level of SQL querying and/or BI integration, although their real value comes in leveraging HBase to provide transactional capabilities that analytic databases aren’t designed to do.

Even though all these choices and approaches might add to the confusion over how to use Hadoop and which products to choose, the result is a net gain for Hadoop as the de facto platform for big data environments even in the face of some alternative approaches. It has changed from a batch system to an interactive query engine pretty much overnight, so although he wouldn’t comment on the competition, Zedlewski wasn’t just blowing vendor smoke when told me, “I would argue Impala is a proof point that Hadoop as a platform has an ability to grow that no other data management platform has.”