It’s now established that major Hadoop conferences drive a news cycle, and last week’s Hadoop Summit in San Jose, Calif., fit the pattern well. Although I couldn’t be at the event itself, a few vendors briefed me on their announcements for the show. In fact, each vendor had multiple announcements for the show and, taken together, they further the 2014 trend of Hadoop becoming more ubiquitous and mature. These announcements also strengthen a less-obvious trend this year: the diversification of the Hadoop platform.
MapR has announced that a new Hadoop App Gallery — something that could take a lot of the friction out of working with the Hadoop platform. A plethora of solutions for the platform has existed for a while, but discoverability for individual analyses was a lot higher for those “in the know” than for someone, or some company, new to the ecosystem. If Hadoop is to become enterprise standard, this kind of stuff is required to get it there. MapR also announced a partnership with Syncsort, which will enable users of its Hadoop distro to offload ETL workloads to mainframe resources — another sensible, enterprise-oriented move.
As nice as it is in theory to deploy Hadoop on a bunch of server boxes that a company might source and configure on its own, many enterprise customers will prefer more plug-and-play. So while the phrase “Hadoop appliance” may sound like an oxymoron it is nonetheless something customers will want. So Teradata‘s announcement of an enhanced Teradata Appliance for Hadoop will likely resonate with several companies, especially those who are Teradata customers. And if those customers want to ease their deployment with professional services in addition to the power-and-ping appliance,Teradata is beefing up those offerings, too.
So what about that diversification? To begin with, Teradata is introducing its own Hadoop Distribution, which it calls the Teradata Open Distribution for Hadoop (TDH). TDH, which is included with the Appliance for Hadoop, is in fact based on Hortonworks Data Platform (HDP), but with Teradata extras thrown in. Microsoft’s HDInsight distribution of Hadoop is similarly a superset of HDP, although on the Windows platform.
It’s not just for MapReduce anymore
Hadoop Summit also brought announcements of new alternatives to MapReduce on the Hadoop platform. MapR is announcing that Apache Drill (a project with which it is heavily involved) will be supported by the company for use with its Hadoop distribution later this month. At first blush Drill looks like another SQL-on-Hadoop solution — perhaps because, among other things, it is. But rather than requiring data in Hive format with a declared schema in HCatalog, Drill can query virtually any file in HDFS. Drill is also designed to work really well HBase, and has full support for querying nested data inside HBase column families. Finally, Drill supports full ANSI SQL, rather than Hive’s dialect of the query language, known as HiveQL.
I’ve written recently about the growing presence of and interest in Apache Spark, which implements a distributed in-memory data engine over Hadoop. Hadoop Summit provided a great pretext for the Apache Software Foundation to announce that Spark has officially reached v1.0. Anointing the latest release of the Spark code with that 1.0 version means that Spark’s APIs will now stabilize, with breaking changes going from few to almost none. This version also brings improvements to MLLib, Spark’s machine learning library; GraphX, its graph processing component and Spark Streaming, which now integrates with Apache Flume.
Version 1.0 of Spark also brings a new component called Spark SQL. Now, in addition to the Spark API itself, Java, Python, and Scala developers can use a Hive-compatible SQL dialect to query data in Spark, adding support for schema-based data to the platform, compatible with both Hive/HCatalog files as well as Parquet files stored in HDFS. The Shark project, which implements its own Hive-compatible SQL layer on Spark will likely be refactored to use Spark SQL under the hood.
Bridge to the Enterprise
Adding to Spark’s coming of age: Foundational BI player MicroStrategy‘s platform is now certified on it. MicroStrategy didn’t actually brief me on this news, but its significance is clear: Spark isn’t just pervasive in the Hadoop world, but in the Enterprise BI sphere as well.
Hadoop is mainstreaming and rapidly becoming an in-memory platform that is Enterprise BI-friendly, with SQL capabilities that extend well beyond Hive formatted-data. And yet, for the purists, its classic MapReduce operations are still powerful, robust and well situated for continued support.