Hadoop’s civil war: Does it matter who contributes most?

If you were going to buy a service contract for your open-source software, would you prefer your service provider actually be the certifiable authority on that very software? If “yes,” then you understand why Cloudera and Hortonworks have been playing a game of one-upmanship over the past few weeks in an attempt to prove whose contributions to the Apache Hadoop project matter most. However, while reputation matters to both companies, it might not matter as much as fending off encroachments to their common turf.

A few weeks ago, Hortonworks, the Hadoop startup that spun out of Yahoo (s yhoo) in June, published a blog post highlighting Yahoo’s — and, by proxy, Hortonworks’ — impressive contributions to the Hadoop code. Early this week, Cloudera CEO Mike Olson countered with gusto, laying out a strong case for why Cloudera’s contributions are just as meaningful, maybe more so. Yesterday, it was Hortonworks CEO Eric Baldeschwieler firing back with even more evidence showing that, nope, Yahoo/Hortonworks is actually the best contributor. The heated textual exchange is just the latest salvo in the always somewhat-acrimonious relationship between Yahoo and Cloudera, but now that Team Yahoo is in Hadoop to make money, he who claims the most expertise might also claim the most revenue.

From Olson's post.
From Baldeschwieler's post.

Hortonworks is betting its entire existence on it. With the company likely not offering its own distribution, Hortonworks will rely almost exclusively on its ability to support the Apache Hadoop code (and perhaps some forthcoming management software) for bringing in customers. This is a risky move.

To make a Linux analogy, Hortonworks is playing the role of a company focused on supporting the official Linux kernel, while Cloudera is left playing the role of Red Hat(s rht), selling and supporting its own open-source, but enterprise-grade, distribution. Maybe Hortonworks should try to be Hadoop’s version of Novell. Whatever you think about the companies’ respective business models, though, it’s clear why reputation matters.

However, I’ve been told by a couple of people deeply involved in the big data world that perhaps Hortonworks and Cloudera would be better served if they spent their energies worrying about a common enemy by the name of MapR. MapR is the Hadoop startup that has replaced the Hadoop Distributed File System with its own file system that it claims far outperforms HDFS and is much more reliable, and that already has a major OEM partner in EMC (s emc).

Ryan Rawson, director of engineering at Drawn to Scale and chief an architect for working on HBase, told me he’s very impressed with MapR and that it could prove very disruptive in a Hadoop space that has thus far been dominated by Cloudera and core Apache. “The MapR guys definitely have a better architecture [than HDFS],” he said, with significant performance increases to match.

Rawson’s rationale for finding such promise in MapR is hard to argue with. As he noted, “garage hobbyists” aren’t building out large Hadoop clusters, but rather, real companies doing real business. If MapR’s file system outperforms HDFS by 3x, that might mean one-third the hardware investment and fewer management hassles. These things matter, he said, and everyone knows that there’s no such thing as a free lunch: Even if they give away the software, Cloudera and Hortonworks still sell products in the form of services.

It’s not just MapR that’s trying to get a piece of Apache Hadoop’s big data market share, either. As I explained earlier this week, there are and will continue to be alternative big data platforms that might start looking more appealing to customers if Hadoop fails to meet their expectations.

The Apache Hadoop community, led for the most part by Hortonworks and Cloudera, has some major improvements in the works that will help it address many of its criticisms, but they’re not here yet. Does it matter which company drives the code and patches for those improvements? Yes, it does. But maybe not as much as burying the hatchet and making sure the Apache Hadoop they both rely on remains worth using.

Feature image courtesy of Flickr user aj82.