Just over a month after discontinuing its Hadoop distribution to focus on the flagship Apache Hadoop project, Yahoo (s yhoo) is already proposing some changes to the Hadoop MapReduce component that could significantly improve processing performance. The new proposal is fairly complex — and is laid out in detail on the Yahoo Developer Network blog — but the gist is that Yahoo wants to replace the current known-bottleneck JobTracker node in Hadoop MapReduce with a two-pronged ResourceManager that would more efficiently match applications and cluster resources. On a higher level, though, the proposal illustrates just how beneficial Yahoo’s renewed focus on Apache Hadoop could be in the long run.
Few would argue that Hadoop isn’t a great tool for parallel processing of big data workloads, but it does have its limitations, the JobTracker and NameNode bottlenecks among them. With Apache being the most popular Hadoop distribution around — it’s even the foundation of Cloudera’s distribution — it needs all the help it can get to continue to lead the way in Hadoop development. Although user organizations like Facebook contribute some very valuable code based on their Hadoop deployments, Yahoo created Hadoop and famously relies on it to some degree for every click across its vast web presence. (Check out this video for a look at how pervasive Hadoop is within Yahoo.) I’ve said in the past that Facebook might become the next great champion of Hadoop development (sub req’d), but the crown probably still resides with Yahoo. Either way, Apache Hadoop benefits greatly from having organizations that rely heavily on Hadoop to run their businesses propose improvements that have been tried and true in real-world deployments.
Without this kind of involvement from the user community, certain aspects of Hadoop development might be left to private software companies, which isn’t particularly beneficial to organizations that want to leverage Hadoop’s FOSS nature, or to Cloudera, which must then undertake major development on its own to make Hadoop better suited for commercial users. Already, for example, Appistry offers an entirely distributed alternative to the Hadoop Distributed File System and Pervasive Software offers an alternative to Hadoop MapReduce designed to improve performance by better leveraging multicore processors. As Yahoo’s new proposal shows, Hadoop users like itself and Facebook are always thinking about ways to improve the Hadoop experience, and having all of their efforts being funneled back into Apache Hadoop can only help speed the pace of innovation to help Hadoop keep up with ever-advancing needs.