The biggest opportunity in Hadoop is capitalizing on the community

If you go to any big data conference or get in a room full of businesspeople concerned with Hadoop, the question will inevitably arise of which Hadoop vendor is going to win (whatever that means). Will it be Cloudera? Hortonworks? MapR? Intel?! As far as I’m concerned, the answer is that they all have their own strengths and weaknesses, but they’re also all struggling with a big, hairy and hugely important question: How do we innovate without cannibalizing customer support and without offending Hadoop’s open source sensibilities?

The problem is that the business of Hadoop is something like bacteria in a petri dish: It’s an experiment in building an entirely new market out of open source software, and no one is quite sure how it will evolve or what effects certain decisions will have.

Out in the real world, though — at places such as Facebook and Twitter — there are other strains of Hadoop developing. Strains that might make those lab versions a lot stronger.

Who can afford to innovate in an open source world?

To put a finer point on it, consider the case of a web company infrastructure exec with whom I was chatting recently. He swore up and down that he doesn’t always want to build his own big data software, but that there’s no place to get what needs anywhere else. He wants what Cloudera Impala, Hortonworks/Apache Stinger and MapR/Apache Drill are promising. He just wants it better and, well, he wanted it yesterday.

But he also appreciates the challenge these vendors face. Their businesses require significant investments in sales, services/support and general community education, and trying to build something like a new database is really hard. Of whatever portion of the budget goes toward product development, a good portion probably goes toward improving the core products to address what their customers need.

Even if they have the budget, companies still must find a way to recoup the development costs. Hadoop is an open source technology — an Apache project — at its core, and companies pushing proprietary or even open core software aren’t always greeted with open arms. Open source software maintains the status quo, but development can be slow if you rely on a community and it can be hard to monetize.

The web as one big R&D department

Back to that web executive, the truth is that as much as he bemoans a lack of vendor innovation, his company would probably have built its own software anyhow — because that’s what big web companies do. Their real value to Hadoop vendors isn’t as customers but as R&D departments. They’re the ones doing the really interesting work around Hadoop right now, but they have limited interest in seeing any of it become commercial software of any sort.

Twitter engineers webscale big data event hosted by Facebook.
Twitter engineers webscale big data event hosted by Facebook.

Facebook, Twitter, LinkedIn, Netflix, Yahoo and even Airbnb are all building some significant technologies — interactive SQL engines, graph engines, stream-processing engines, schedulers, cloud-based tools. Even some startup big data vendors such as Continuuity, WibiData and Mesosphere, whose founders cut their teeth in large web shops, are releasing open source software.

Occasionally these technologies become Apache projects, but often the code is just dumped into Github or some other online repository. It’s scattered around the web, all related but often disconnected, like a rock star’s kids. If these technologies advance, it’s within the echo chamber of these same companies as engineers mingle at events throughout Silicon Valley.

I think commercializing these projects presents a huge opportunity for someone brave enough to try. The code is out there and at least some version of it is already running in production in a cutting-edge environment (that’s what made Yahoo such a valuable contributor to Apache Hadoop during its formative years). I’ve heard of big mainstream companies asking these web companies to send their engineers in and train their IT staff on these technologies. So it seems like there’s demand.

Already, the only thing most Hadoop distributions have in common are the core Apache components like MapReduce, HDFS, YARN, HBase, ZooKeeper and so on. So why wouldn’t a vendor try capitalize on the work of the Hadoop user community by grabbing the best stuff, forking it and turning it into revenue? It probably won’t be technically easy, but it has to be easier than starting from scratch and is certainly better than doing nothing.