As Big Data Takes Off, the Hadoop Wars Begin

It turns out “big data” isn’t just a buzzword, but a legitimate concern for companies across the board. Their interest in the tools to take advantage of the opportunity for analysis of all this data has sparked a land grab among established vendors and startups alike. The action is centered around Hadoop, the flagship technology for storing and processing large amounts of unstructured data.

Since Yahoo (s yhoo) open-sourced Hadoop a few years ago, the primary options for organizations wanting to take advantage of the product have been the open-source Apache Hadoop distribution, the Cloudera distribution of Hadoop ,and Amazon (s amzn) Web Services’ Elastic MapReduce service. That will change soon, as everyone from EMC (s emc) and IBM (s ibm) to database startups like Hadapt and DataStax get into the business of selling Hadoop-based technologies and services.

So far, Cloudera, which provides commercial support for its open-source distribution, as well as its own proprietary Hadoop-cluster management software, has been the only company to truly capitalize on Hadoop financially. Arguably, its success is to blame for the stiff competition it’s about to face for companies’ Hadoop attention and dollars.

Too Many Distributions

Cloudera, a private company, hasn’t released any financial details, but Wednesday at Structure: Big Data, VP of Engineering Amr Adawallah mentioned during a panel that Cloudera has more than 80 customers running Hadoop in production, and the company does have technology partnerships across the data world, including a leading data warehouse, BI, and database vendors. Cloudera also has raised $36 million from investors since launching in 2009. It appears the other software companies have noticed all the activity around Cloudera and want in on some of the action.

IBM already has a Hadoop business that includes its own distribution it says is better suited for commercial users than the open-source Apache Hadoop distribution, though both IBM and  Cloudera are based on the Apache distribution. IBM’s offering also provides an application called InfoSphere BigSheets, which hides the complexities of Hadoop underneath a variety of advanced analytics, BI and visualization tools. Based on a few sources I spoke with at Structure: Big Data, and after reading into an advertisement in the program for the conference , it looks EMC is getting into the game. The ad hints that EMC will announce a Hadoop product involving its new Greenplum database on May 9: The ad read, “05.09.11: EMC Greenplum. Apache Hadoop.” Also at the event, two independent sources suggested members of Yahoo’s Hadoop team will be spinning off their own separate business, and there is speculation this move is somehow tied into EMC’s Hadoop plans.

IBM isn’t to be taken lightly, nor is EMC on its own, but the latter turn of events would be a potentially market-changing situation given the Hadoop know-how within Yahoo, which has contributed the majority of the code now included in Apache Hadoop. During a panel at Structure: Big Data, Yahoo’s VP of Cloud Architecture Todd Papaioannou, quipped to Cloudera’s Awadallah that Yahoo will keep innovating on Hadoop and everyone could keep reselling it. Papaioannou declined to comment on the rumors of a Hadoop spinout, but did tell me via email, “I think Apache Hadoop will remain the go-to place to get access to new improvements and innovation in the core Hadoop platform. That’s exactly why we announced our ‘double down’ strategy and the work we are doing on the next generation of both Map Reduce and HDFS.”

Death by a Thousand Startups

It’s not only large vendors that Cloudera will have to fight off; its real threat is death by a thousand startups and ISVs. At Structure: Big Data, NoSQL startup DataStax announced its own open-source Hadoop distribution based on the NoSQL database Cassandra, which provides a replacement for the Hadoop Distributed File System (HDFS). DataStax says this gives users the ability to process data and feed it back to applications at extremely low latencies, which Cloudera can’t offer because Apache Hadoop environments currently reside on separate infrastructure from application servers and databases. Om wrote earlier about Mapr, a startup focused on improving the performance and reliability of the HDFS. Appistry is already addressing this with its own wholly-distributed HDFS alternative.

Launches weren’t over yet. Another database startup called Hadapt officially launched Wednesday with a product that melds the HDFS-based HBase database with traditional RDBMS capabilities. HBase is an Apache Hadoop subproject heavily used by Facebook, and included as part of Cloudera’s Hadoop distribution. And next Tuesday, high-performance computing pioneer Platform Computing — which has a presence in many large financial data centers and 10 of the top 20 Fortune companies — will be announcing an analytics offering that applies its current cluster- and grid-management capabilities to MapReduce workloads. As noted above, management tools are where Cloudera actually makes money selling software as opposed to services.

There are several commercial alternatives to Apache Hadoop MapReduce, as well. Pervasive Software’s DataRush software is designed for writing big data workflows and to take full advantage of multi-core processors. And Cascading, an open-source, data-processing API sits atop MapReduce. A startup called Concurrent offers commercial support and services for Cascading. Amazon Web Services offers a cloud-based Hadoop service called Elastic MapReduce, which spares users the cost of buying their own gear on which to run Hadoop workloads.

Confused? Here’s a round-up of currently available Hadoop distributions:

Full-on distributions

  • Apache Hadoop
  • Cloudera’s Distribution including Apache Hadoop (that’s the official name)
  • IBM Distribution of Apache Hadoop
  • DataStax Brisk
  • Amazon Elastic MapReduce

HDFS alternatives

  • Mapr
  • Appistry CloudIQ Storage Hadoop Edition
  • IBM Global Parallel File System (GPFS)
  • CloudStore

Hadoop MapReduce alternatives

  • Pervasive DataRush
  • Cascading
  • Hive (an Apache subproject, included in Cloudera’s distribution)
  • Pig (a Yahoo-developed language, included in Cloudera’s distribution)

Cloudera Isn’t Flinching — Yet

Even with all this competition, however, it’s unclear whether Cloudera actually feels its iron grip on the commercial Hadoop world slipping away. CEO Mike Olson thinks a rich ecosystem of Hadoop companies is necessary if it’s to grow into a multi-billion-dollar business like he thinks it can, but he sees most of that activity taking place up the stack from the foundational distribution layer where Cloudera operates. He said via email, “I believe there’s an enormous opportunity for smart companies, and even open-source projects, to build a new generation of data analysis tools on top of that platform.”

His colleague Awadallah was slightly less politic in his response when asked specifically about the DataStax distribution, stating in a video interview with my colleague Stacey Higginbotham Wednesday that he thinks DataStax’s distribution is a “big mistake,” and he doesn’t think the company can yet back up its claims of Hadoop support. He added that a better alternative to trying to reinvent the wheel in terms of Hadoop support and stability would have been for DataStax to keep its focus on Cassandra partner with Cloudera on the Hadoop integration.


Cloudera has plenty of reason to be confident, actually. Among its ranks are Hadoop creator Doug Cutting and former Yahoo colleague Awadallah, as well as Chief Scientist Jeff Hammerbacher — who previously led Facebook’s massive data efforts — and Vertica vetertan Omer Trajman. Olson himself is the former CEO of SleepyCat Software, which distributed the open-source Berkeley DB database before Oracle (s orcl) bought the company in 2006. Or, as Adwallah put it, “[W]e have the muscle to be able to back up our words with execution.” Further, as long as Facebook and Yahoo continue contributing their webscale-driven — and proven — enhancements back to Apache Hadoop, Cloudera has plenty of fuel to feed its evolution. Facebook, for example, is responsible for the popular Hive query language that gives Hadoop users a SQL-like experience many prefer to MapReduce, and, as noted above, Yahoo is currently pushing for a next-generation architecture that addresses some known performance bottlenecks with Apache Hadoop.

But the threat is real. Cloudera has partnerships with many analytics vendors, but none of the companies mentioned here are operating up the stack from Cloudera. They’re all addressing the foundational HDFS, Hadoop MapReduce and cluster-management areas where Cloudera presently does business (although IBM and EMC are operating up the stack with analytics software, too). With so many options available — and with Apache Hadoop code open to anyone who wants to use it — every vendor with aspirations of making big money in Hadoop is going to have to work extra hard to convince users they’re adding value worth paying for.

Image courtesy of Flickr user