Cloudera is rebuilding machine learning for Hadoop with Oryx

Hadoop software vendor Cloudera didn’t make a lot of waves when it bought a London-based startup called Myrrix last year, and it hasn’t made a lot of noise about the company’s machine learning technology since then. But the company’s technology and its founder, Sean Owen, could turn out to be very valuable assets.

Owen, whose official title is director of data science, now spends him time working on an open source machine learning project called Oryx. (It’s a species of African antelope; Cloudera also sells a product called Impala). Oryx is intended to help Hadoop users build machine learning models and then deploy them so they can be queried and serve results in real time, say as part of a spam filter or a recommendation engine. Ideally, Oryx will also suuport models that can update themselves as data streams in.

Owen calls it the difference between Hadoop’s traditional sweet spot of exploratory analytics (playing with data and looking for interesting patterns) and operational analytics.

Sean Owen. Source: Lanyrd
Sean Owen. Source: Lanyrd

“Once I’ve figured out how to model fraud on my website, I probably want to do something with it,” he explained. “… We should have a way in Hadoop to build models at scale, but also to implement models at scale.”

Apache Mahout, the traditional avenue for building machine learning models in Hadoop, “has reached the end of its road,” Owen said. It’s stuck in a batch-only first-generation MapReduce era, and it requires a lot of work on users’ parts to get a working system in place. “Myrrix [which is a rewrite of Mahout] is what I always wanted Mahout to be,” he said, adding that if Mahout was really working well Cloudera probably wouldn’t have acquired Myrrix. Oryx is about 90 percent code from Myrrix with some post-Cloudera code included, as well.

Open, easy recommendation engines, anyone?

Rather than try to make it a library of machine learning algorithms, Owen is really focused on four big ones — regression, classification, clustering and collaborative filtering (aka recommendations). Owen said the last one is the most popular right now, and he’s working with some Cloudera customers on using Oryx to implement recommendation systems. In fact, about 80 percent of Oryx users are trying to build recommendation engines.

Making Oryx a standard tool for building recommendation systems would be a big boon for the project’s popularity. Recommendations, of course, are standard fare on popular sites such as Netflix, Amazon and just about every website, but there’s a surprising dearth of standard, open source tools for building them.

It’s not quite a race, but others are trying to standardize recommendations as well. For example, cloud startup Mortar Data is currently seeking 15 companies to work (for free) with prominent data scientists on building custom recommendation engines. It’s a project the company started last year and hopes will surface best practices that can improve its open source recommendation framework. Other companies, such as Expect Labs, aren’t going open source, but are trying to automate recommendations via artificial intelligence APIs.

The Oryx architecture
The Oryx architecture

Still a project, not a product

Owen thinks all of Cloudera’s customers (and probably most Hadoop users) will want to do operational machine learning eventually — and for a lot more than recommendations — and Oryx could be the tool that helps them do it. However, he’s quick to note, “In a way, it’s still a bit of a labs project.”

Right now, for example, Owen is spending a lot of time contributing to the Apache Spark project because he plans to rewrite Oryx to make Spark the primary processing framework instead of MapReduce. “There’s actually a lot of reasons to be interested in Spark from a machine learning point of view,” he said. “… I’d much rather put my energies there.”

He’s not alone. As we have explained, Spark is becoming a popular choice for next-generation big data applications and companies such as Cloudera and Hortonworks are embracing it as a big part of Hadoop’s future. Cloudera CEO Tom Reilly will join a bevy of other big data CEOs, data scientists and CIOs at our Structure Data conference in March to talk about the future of the Hadoop platform (including Spark’s role), and actual applications of machine learning to transform businesses and society.

For all its promise, though, Owen doesn’t think we should expect to see Oryx make its way into Cloudera’s Hadoop distribution or product lineup anytime soon. “Customers want advice, service and training, and that will morph into software,” he said. But right now: “[I]t’s not anywhere near that.”

“It’s still early for the majority of the Hadoop-consuming market to embrace data science,” he said, “let alone operational real-time machine learning.”

Feature image courtesy of Wikimedia Commons user Thepedestrian.