Facebook open sources Corona — a better way to do webscale Hadoop

Facebook (s fb) is at it again, building more software to make Hadoop a better way to do big data at web scale. Its latest creation, which the company has also open sourced, is called Corona and aims to make Hadoop more efficient, more scalable and more available by re-inventing how jobs are scheduled.

As with most of its changes to Hadoop over the years — including the recently unveiled AvatarNode — Corona came to be because Hadoop simply wasn’t designed to handle Facebook’s scale or its broad usage of the platform. What kind of scale are we talking about? According to Facebook engineers Avery Ching, Ravi Murthy, Dmytro Molkov,? Ramkumar Vadali, and Paul Yang in a blog post detailing Corona on Thursday, the company’s largest cluster is more than 100 petabytes; it runs more than 60,000 Hive queries a day; and its data warehouse has grown 2,500x in four years.

Further, Ching and company note — echoing something Facebook VP of Infrastructure Engineering Jay Parikh told me in September when discussing the future of big data startups — Hadoop is responsible for a lot of how Facebook runs both its platform and its business:

Almost every team at Facebook depends on our custom-built data infrastructure for warehousing and analytics, with roughly 1,000 people across the company — including both technical and non-technical personnel — using these technologies every day. Over half a petabyte of new data arrives in the warehouse every 24 hours, and ad-hoc queries, data pipelines, and custom MapReduce jobs process this raw data around the clock to generate more meaningful features and aggregations.

So, what is Corona?

In a nutshell, Corona represents a new system for scheduling Hadoop jobs that makes better use of a cluster’s resources and also makes it more amenable to multitenant environments like the one Facebook operates. Ching et al explain the problems and the solution in some detail, but the short explanation is that Hadoop’s JobTracker node is responsible for both cluster management and job-scheduling, but has a hard time keeping up with both tasks as clusters grow and the number of jobs sent to them increase.

Further, job-scheduling in Hadoop involves an inherent delay, which is problematic for small jobs that need fast results. And a fixed configuration of “map” and “reduce” slots means Hadoop clusters run inefficiently when jobs don’t fit into the remaining slots or when they’re not MapReduce jobs at all.

Corona resolves some of these problems by creating individual job trackers for each job and a cluster manager focused solely on tracking nodes and the amount of available resources. Thanks to this simplified architecture and a few other changes, the latency to get a job started is reduced and the cluster manager can make fast scheduling decisions because it’s not also responsible for tracking the progress of running jobs. Corona also incorporates a feature that divvies a cluster into resource pools to ensure every group within the company gets its fair share of resources.

The results have lived up to expectations since Corona went into full production in mid-2012: the average time to refill idle resources improved by 17 percent; resource utilization over regular MapReduce improved to 95 percent from 70 percent (in a simulation cluster); resource unfairness dropped to 3.6 percent with Corona versus 14.3 percent with traditional MapReduce; and latency on a test job Facebook runs every four minutes has been

Despite the hard work put into building and deploying Corona, though, the project still was a way to go. One of the biggest improvements currently being developed is to enable resource management based on CPU, memory and other job requirements rather than just the number of “map” and “reduce” slots needed. This will open Corona up to running non-MapReduce jobs, therefore making a Hadoop cluster more of a general-purpose parallel computing cluster.

Facebook is also trying to incorporate online upgrades, which would mean a cluster doesn’t have to come down every time part of the management layer undergoes an update.

Why Facebook sometimes must re-invent the wheel

Anyone deeply familiar with the Hadoop space might be thinking that a lot of what Facebook has done with Corona sounds familiar — and that’s because it kind of is. The Apache YARN project that has been integrated into the latest version of Apache Hadoop similarly splits the JobTracker into separate cluster-management and job-tracking components, and already allows for non-MapReduce workloads. Further, there is a whole class of commercial and open source cluster-management tools that have their own solutions to the problems Corona tries to solve, including Apache Mesos, which is Twitter’s tool of choice.
However, anyone who’s familiar with Facebook knows the company isn’t likely to buy software from anyone. It also has reached a point of customization with its Hadoop environment where even open-source projects from Apache won’t be easy to adapt to Facebook’s unique architecture. From the blog post:

It’s worth noting that we considered Apache YARN as a possible alternative to Corona. However, after investigating the use of YARN on top of our version of HDFS (a strong requirement due to our many petabytes of archived data) we found numerous incompatibilities that would be time-prohibitive and risky to fix. Also, it is unknown when YARN would be ready to work at Facebook-scale workloads.

So, Facebook plods forward, a Hadoop user without equal (save for maybe Yahoo (s yhoo)) left building its own tools in isolation. What will be interesting to watch as Hadoop adoption picks up and more companies beging building applications atop it is how many actually utilize the types of tools that companies like Facebook, Twitter and Quantcast have created and open sourced. They might not have commercial backers behind them, but they’re certainly built to work well at scale.

Feature image courtesy of Shutterstock user Johan Swanepoel.