Hortonworks lays out a future for Hive that includes transactions, Spark and sub-second queries

Conventional wisdom says that Hive isn’t fast enough for true interactive queries, but Hortonworks is promising a next-generation Hive that can handle read-write transactions, support the full set of SQL semantics rather than a subset of them and deliver query results in less than a second. It’s also getting on board with the push to integrate Hive and Apache Spark so the former can handle users’ machine-learning jobs. Hortonworks is calling this strategy, and the resulting set of new capabilities “enterprise SQL at Hadoop scale.” is the second large effort Hortonworks has spearheaded in order to improve the performance of Hive. The company recently completed its goals for the initial Stinger project, on which it began work in 2012 and which the company claims improved the performance of Hive by 100 times while also improving its functionality.

The company explains the details of how it, along with the Hadoop community, plans to pull this off in a blog post on Wednesday. It also gives a rough timeline of when the three-phase plan will be complete: ACID transactions by the end of this year; sub-second queries and Spark integration in the first half of next year; and full SQL queries along with geographically distributed queries by the end of 2015. There are a few other deliverables tied to each of those timeframes, as well.

Source: Hortonworks
Source: Hortonworks

A successful initiative could be a major annoyance, to say the least, for the slew of companies that have already committed untold man-hours and financial resources building out their own SQL-on-Hadoop engines based on the premise that Hive — even running on Spark — would never be fast enough. Commercially available products include Cloudera Impala, IBM Big SQL, Pivotal Greenplum and startup Splice Machine’s eponymous database technology. Open source projects and others still under development include the Facebook-built Presto, the Apache Phoenix and the MapR-led Apache Drill.

The Apache Spark community is also working on its own interactive SQL engines called Spark SQL and BlinkDB.

Explaining his company’s decision to build Impala and bet the future on it with regard to interactive SQL, Cloudera Co-founder and Chief Strategy Officer Mike Olson recently told me, “Impala is flat-out faster than the fastest thing Hortonworks or anyone else has ever done with Hive.”

That might continue to be true, and it might be true that every technology mentioned will continue to be better than a new-and-improved Hive in one way or another. But it’s also true that a more-capable Hive is going to look very appealing to a lot of users that have been using Hive for years and don’t want to incorporate an entirely new technology (although does involve some major architectural changes), or that prefer their Hadoop components to be as open as possible.

I’m not sure running SQL jobs on Hadoop is the ultimate use case for what many claim will be a revolutionary data platform, but given the size of the database market, it’s potentially a lucrative one and it’s helping highlight some big differences between each vendor’s approach to selling Hadoop.