San Jose, Calif.-based storage startup MapR, which provides a high-performance alternative for the Hadoop Distributed File System, will serve as the storage component for EMC’s (s emc) forthcoming Greenplum HD Enterprise Edition Hadoop distribution. This alliance helps differentiate EMC from other Hadoop vendors, and adds immediate credibility to MapR’s technology along with a strong distribution channel.
Today’s announcement of the licensing agreement between the two companies confirms what I suspected when EMC unveiled its Hadoop plans earlier this month, after MapR CEO John Schroeder took the stage at EMC World and EMC itself described Enterprise Edition features that closely resemble what MapR provides.
Hadoop is an Apache Software Foundation project that consists of a set of tools for storing and processing large amounts of unstructured data. The two core components are the Hadoop Distributed File System for storing data and Hadoop MapReduce for writing parallel-processing jobs.
EMC’s Hadoop strategy is actually quite unique, and its decision to embrace MapR is strong evidence of this. Coming into the Hadoop world with knowledge of the shortcomings of the current version of HDFS, EMC wanted a storage layer that would improve upon HDFS in terms of performance, avaialbility and ease of use. It could have attempted to bolt on its Isilon clustered file system or used its considerable engineering talent to improve upon HDFS, but EMC spotted a quality product in MapR and jumped on it.
Another unique element of EMC’s Hadoop distribution is that rather than being based on the official Apache version of the code, it’s based on Facebook’s Hadoop code (sub req’d) that has been optimized for scalability and multi-site deployment.
Not to be outdone, commercial Hadoop pioneer Cloudera announced an HDFS partnership of its own yesterday. Cloudera Distribution of Hadoop users can now RainStor’s data retention system to improve upon HDFS with serious compression, deduplication and compliance features. RainStor claims it can reduce the footprint of HDFS volumes by 97 percent while providing “built-in security, audit trails and granular retention and expiry policies for managing the lifecycle of stored data.” Additionally, customers can access data stored within RainStor via standard avenues such as SQL.
Both companies are taking different approaches to improving the HDFS experience. By not tethering itself to the Apache Hadoop project, EMC is able to address enterprise needs by leveraging MapR’s innate high availability, high performance, and advanced features such as mirroring and replication. Cloudera, on the other hand, is a major contributor to Apache Hadoop and will incorporate changes to the HDFS architecture and features as Apache officially adopts them. However, Cloudera can rely on partnerships like that with RainStor to improve the HDFS experience without distracting it attention from improving the open source Apache Hadoop code.
It’s arguable that the primary benefit to the Cloudera approach is that it’s open source, which means customers willing to wait for HDFS improvements won’t have to pay for them. EMC’s Greenplum HD Enterprise Edition, which incudes the MapR technology, will cost customers money.
As interest in Hadoop gains momentum among mainstream companies, the competition to provide the most-complete Hadoop experience is getting intense. Whether they rely on almost solely on the Apache Hadoop code, such as Cloudera, or not, such as EMC, vendors need to show potential customers that they can address real-world needs. There isn’t a lot of money being spent on Hadoop products right now, but all signs point to that changing very soon, and then we’ll see whose approach carries the day.