Cloudera adds search to Hadoop distro and says it’s just getting started

According to Cloudera CEO Mike Olson, his company has “decades” in front of it in which to enhance its Hadoop platform to become the go-to place for data storage and analysis. At a Tuesday event in San Francisco, Cloudera announced the latest feature meant to further that strategy — full-text search. It comes just weeks after the company’s Impala interactive SQL query engine became publicly available.

The general idea behind adding search (something competitor MapR actually did in May), is to let people without deep technical skills find the information they need within a Hadoop cluster in a way that’s familiar to them. “You don’t even have to understand what SQL is. You can just type words into a box,” Olson said during a recent phone call, comparing Cloudera’s search to the process of finding information online or within your Gmail history.

Structure Data 2012: Michael Olson – CEO, Cloudera
Cloudera CEO Mike Olson at Structure: Data 2012
(c) 2012 Pinar Ozger [email protected]

“Think about it,” he added. “You got a petabyte of data, you can’t use folders anymore.”

Even though search is easier than SQL, though, it seems pretty obvious that hourly workers and front-desk staff probably won’t be rooting around in Hadoop searching for data (although that’s possible in theory if the right application was in place).

However, a couple of examples from the search feature’s private beta users (it’s now available in public beta and will be generally available in the third quarter) help illustrate what Olson is talking about and how it might apply in corporate settings. Agri-business giant Monsanto is using search to help index — and later find information from — its collections of images that track plant characteristics through their lifecycle, a process that used to require lots of manual work within a database not designed to handle images and metadata. Health care customer Exlorys is using Cloudera’s search tool to consolidate and index its server logs so it can track down IT issues more easily and maintain SLAs for its applications.

Discussing MapR’s new search feature in April, VP of Marketing Jack Norris suggested a use case wherein users might use MapReduce to cluster a group of customers and then use search to drill down further into their behavior.

Cloudera’s search is powered by the Apache Solr project, which happens to be based on the Apache Lucene project that Cloudera Chief Architect Doug Cutting founded before he founded Hadoop. Exact features of Cloudera Search, as well as a quote from private beta user Dell, are available in the product’s press release. MapR’s search is powered by LucidWorks, a commercial search platform based on the Solr and Lucene projects.


Olson said Cloudera is dedicated to continually improving the capabilities of Solr now that it’s officially part of the company’s Hadoop distribution. When asked about predictive and semantic search like consumers now experience with Google(S GOOG) and Microsoft (S MSFT) Bing, he pointed to a feature called Navigator — which keeps track of who touched pieces of data, what systems it passed through, what types of queries people run on it and various other attributes — as the possible foundation for such features in an enterprise environment. He’s not sure exactly what that might look like in practice, but, Olson added, “I think there’s lots of opportunity for advancement there.”

One platform to rule them all?

The bigger picture here, though, is the encroachment of the open source Hadoop technology — whether sold by Cloudera, Hortonworks, MapR or whomever — into the lucrative data management and analytics space once (and still) dominated by vendors selling expensive software and big-iron systems. For now, Olson said, technologies like Cloudera Impala and the new search feature will be less functional than their legacy counterparts (Teradata(S TDC) for data warehousing and Autonomy(S HPQ) for enterprise search, for example), but that could change over time.

“We have decades of life in front of this company in order to enhance [our platform],” Olson said.

Further, inertia can kick in as companies place more data into Hadoop, making it less appealing to move that data onto another system if the work can just as easily be done within Hadoop. When it comes to how many workloads and how much money Hadoop could ultimately steal from legacy vendors, Olson — like his peers at other Hadoop vendors — is hedging for the time being. “I don’t want to make audacious and unsupportable claims,” he said. “… We can all make up numbers.”

However, he did take some credit fore Teradata’s recent lackluster quarter(s), stating that even though Cloudera customers aren’t ripping out their legacy systems, they’re also not really investing more money into them. “It is true to say folks are looking at what they’re running on Teradata and rationalizing those decisions,” Olson said. “… [They’re trying to] concentrate first-class spend on a first-class workload.”

For more on how the SQL-on-Hadoop community, specifically, intends to take on the legacy vendors, check out this panel from our Structure: Data conference in March.