Cloudera: Impala’s it for interactive SQL on Hadoop; everything else will move to Spark

Despite some speculation over the past few days about what it means that Cloudera wants to port the Hive SQL-on-Hadoop engine onto the Spark processing framework, Cloudera Co-founder and Chief Strategy Officer Mike Olson (pictured above) says nothing much has changed. Well, nothing has changed with regard to Cloudera’s Impala product, that is. There’s actually quite a bit happening elsewhere in the Hadoop and Spark ecosystems.

Simply put, Olson said Impala is the future of interactive SQL queries on top of Hadoop as far as Cloudera is concerned. “Impala is flat-out faster than the fastest thing Hortonworks or anyone else has ever done with Hive,” he said.

Cloudera — along with IBM, MapR and spark startup Databricks — is working to port Hive onto Spark as an acknowledgement that Hive workloads are still very important to the company’s customer base and that “running on MapReduce, Hive really, really sucks.” But, Olson added, Hive was built to be a batch-processing atop MapReduce, and even though it will run faster on Spark or the Hortonworks-driven Apache Tez framework, it will still be a batch job.

(Actually, he added, Cloudera et al are committed to moving pretty much every existing MapReduce workload onto Spark, including stuff such as Sqoop and Pig. Spark is “light years better,” he noted, and “we think it will succeed MapReduce in most instances.”)

The Spark stack. Source: Databricks
The Spark stack. Source: Databricks

Some might be asking where Shark — the Spark subproject whose name is a mashup of Spark and Hive — fits into this. Olson confirmed (actually, he pointed to a Spark Summit keynote by Databricks’ Patrick Wendell) that Databricks will sunset Shark after the next Spark release, opting instead to focus its efforts on a project called Spark SQL that the company announced in April.

Around that time, Databricks CEO Ion Stoica told database industry analyst Curt Monash the same, although he also mentioned plans to continue developing an interactive engine called BlinkDB. “[I]f I were to redraw [the Spark stack diagram], SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB,” Stoica told Monash.

Olson didn’t mention BlinkDB (although, admittedly, I didn’t ask) but he say he’s not thrilled with the idea of Spark SQL. He acknowledged that Databricks is a smart company and will likely do a competent job with Spark SQL, but added that moving Hive onto Spark is a fast process while SparkSQL is still a work in progress.

“I would rather see those guys put all their efforts into other things,” he said. “… I think Hive on Spark is going to be pretty good.”