Microsoft to open source a big data framework called REEF

Microsoft (s msft) has developed a big data framework called REEF (a graciously simple acronym for Retainable Evaluator Execution Framework) that the company intends to open source in about a month. REEF is designed to run on top of YARN, the next-generation resource manager for Hadoop, and is particularly well suited for building machine learning jobs.

Microsoft Technical Fellow and CTO of Information Services Raghu Ramakrishnan explained REEF and Microsoft’s plans to open source it during a Monday morning keynote at the ACM Knowledge Discovery and Data Mining conference, taking place in Chicago.

YARN is a resource manager developed as part of the Apache Hadoop project that lets users run and manage multiple types of jobs (e.g., batch MapReduce, stream processing with Storm and/or a graph-processing package) atop the same cluster of physical machines. This makes it possible not only to consolidate the number of systems that an organization has to manage, but also to run different types of analysis on top of the same data from the same place. In some cases, the entire data workflow can be carried out on just one cluster of machines.

reef2 (1)

However, Ramakrishnan explained, some type of jobs, such as machine learning, aren’t ideal for frameworks such as YARN because they have specific requirements around data movement, task monitoring, and being able to iterate on a previous set of results rather than having to start anew every time. Ramakrishnan said REEF, which is a set of libraries that runs on top of YARN, will solve some of these problems, although he didn’t go into much detail on how, exactly, it works.

One thing he did explain, however, is that REEF is broken into two main parts: Evaluators, which are YARN containers containing REEF services, and Activities, which are the user code that runs inside the Evaluator. He showed a sample workflow where YARN would spin up an Evaluator, the Activity code would run inside it and complete, but then the same evaluator could be spun up again and maintain its original state so other Activities could run against its data. Presumably, this could be anything from a SQL query to another machine learning algorithm.

A presentation slide on REEF's components and capabilities.
A presentation slide on REEF’s components and capabilities.

In theory, REEF is an interesting technology in that it attempts to solve some remaining problems companies are facing as they’re trying to do ever more analysis of their data. I expect we’ll hear a lot more about how REEF works whenever the company releases it. However, REEF is also noteworthy because of how strongly Microsoft has embraced Hadoop — of which YARN is very much a part — and the open source community, in general. A couple years ago, Microsoft was working on an alternative and proprietary platform to Hadoop. Today, it’s taking the Hadoop community’s work and trying to enhance it out in the open.

Update: This post was updated at 8:10 a.m. on Aug. 13 to correct the name of the conference.