EMC Makes a Big Bet on Hadoop

EMC is throwing its weight behind Hadoop. Today, at EMC World, the storage giant announced a slew of Hadoop-centric products, including a specialized appliance for Hadoop-based big data analytics and two separate Hadoop distributions. EMC’s entry is most definitely going to shake up the Hadoop and database markets. EMC is now the largest company actively pushing its own Hadoop distribution, and it has an appliance that will put EMC (s emc) out in front of analytics vendors such as Oracle (s orcl) and Teradata (s tdc) when it comes to handling unstructured data.

EMC’s flagship Hadoop distribution is called Greenplum HD Enterprise Edition. EMC describes it as “a 100 percent interface-compatible implementation of the Apache Hadoop stack” that also includes enterprise-grade features such as snapshots and wide-area replication, a native network file system, and integrated storage and cluster management capabilities. The company also claims performance improvements of two to fives times over the standard Apache Hadoop distribution.

Mapr Magic

It’s noteworthy that many of these capabilities are also available in startup MapR’s HDFS alternative, and that MapR CEO John Schroeder took the stage at a morning EMC World press conference announcing the news. EMC Greenplum’s Luke Lonergan wouldn’t confirm to me that EMC’s Enterprise Edition will use MapR as the primary storage engine, but it’s not too difficult to connect the dots.

However, while the Enterprise Edition is proprietary in part, the Greenplum HD Community Edition is fully open source and still makes big improvements over what’s currently available with the Apache version. In fact, Lonergan told me, Community Edition is based on Facebook’s optimized version of Hadoop. Like Cloudera’s distribution for Hadoop, Community Edition pre-integrates Hadoop MapReduce, Hadoop Distributed File System, HBase, Zookeeper and Hive, but it also includes fault tolerance for the NameNode in HDFS and the JobTracker node in Hadoop MapReduce. These improvements are underway within Apache thanks to Yahoo (s yhoo), but they’re not included in any official release yet.

Too Much Hadoop?

I asked a couple of weeks ago whether the Hadoop-distribution market could handle all the players it now hosts, and now that question is even more pressing. As Luke Lonergan put it during the press conference, EMC is an “8,000-pound elephant” in the Hadoop space, and that should make Cloudera, IBM (s ibm), DataStax and (possibly) Yahoo shake seek higher ground.

For Cloudera, EMC is major threat because it competes directly against Cloudera’s open-source and proprietary products. It even has partnerships with a large number of business intelligence and other up-the-stack vendors, some of which already are Cloudera partners. These include Concurrent, CSC, Datameer, Informatica, Jaspersoft, Karmasphere, Microstrategy, Pentaho, SAS, SnapLogic, Talend, and VMware.

Oh, and Cloudera and Greenplum have an existing integration partnership. As Lonergan noted, “This definitely marks a change [in that relationship].” The two are now competitors, after all.

EMC vs Big Blue

IBM is still the largest company involved in selling Hadoop products, but it presently suffers from the problem of not having yet announced its official Hadoop distribution. EMC’s Hadoop distributions will be available later this quarter. I noted recently how EMC is following IBM’s lead in acquiring capabilities across the big data stack — from Hadoop to predictive analytics — and today’s news further proves how competitive the two storage heavyweights might become in the analytics space, too.

IBM isn’t the only big-name vendor that should be worried about EMC’s new Hadoop-heavy plans, though. The EMC Greenplum HD Data Computing Appliance should make appliance makers Oracle and Teradata, as well as analytic database vendors such as HP (s hpq), ParAccel and others, quite nervous. The appliance is like the existing EMC Greenplum Data Computing Appliance, only it lets customers process Hadoop data within the same system as their Greenplum analytic database. Presently, most analytic databases and appliances integrate with Hadoop, but still suffer from the latency of having to send data over the network from Hadoop clusters to the database and back.

IBM already has integrated Hadoop with its other big data tools, including with InfoSphere BigInsights, Watson and Cognos Consumer Insight, and I have to believe a version of its Netezza analytics appliance with Hadoop co-processing will be on the way shortly, possibly in conjunction with its official Hadoop distribution release.

Lonergan also noted that EMC is working closely with VMware, of which EMC is the majority stockholder, on integrating EMC’s Hadoop products with VMware’s virtualization and cloud products, as well as its GemStone distributed database software.

There still will be opportunities for community collaboration among all the open source Hadoop distributions — Cloudera, DataStax Brisk and EMC Greenplum HD Community Edition — but we’ll see how willing they are to work together now that the competition has really heated up. All of a sudden, EMC looks like the strongest Hadoop company going, and everyone else needs to figure out in a hurry how they’ll counter today’s landscape-altering news.