For the future of big data startups, look to Facebook

For any startups trying to divine where the big data space is headed and where to focus their energies, there are worse places to look than Facebook (s fb). The company collects a lot of data, and in order to handle that data it has created, among other things, the Cassandra NoSQL data store and the Hive query language for Hadoop. Its Hadoop cluster currently stores more than 100 petabytes of user data. If there’s a good idea for an application to make big data technologies even more useful, chances are Facebook is already working on it.

Opportunity 1: Democratize Hadoop

Ping Li

The opportunity, of course, lies in taking those ideas and turning them into products that the business world will eat up. Or, as Accel Partners’ Ping Li put it during a recent phone call, there has been a lot of innovation at the infrastructure layer with Hadoop and NoSQL, “but I’ve had this constant search for ‘now what?'” Some of these applications are popping up — they’re the types of things Accel is looking to invest in via its Big Data Fund — and Li thinks what’s going on in Facebook could serve as the inspiration for even more.

“Pretty much everyone at Facebook is talking to data that’s coming out of a Hadoop cluster,” he said, and they’re not all writing MapReduce jobs.

Fortunately, Facebook VP of Infrastructure Engineering Jay Parikh was on the same call, and he shared a glimpse into how Facebook is boiling big data down to bite-size pieces. Essentially, he said, Facebook uses Hadoop for just about everything, from friend recommendations to ad targeting to analyzing the efficiency of its data centers. But serving all of these uses means making sure the users in each department can actually interact with Hadoop in a meaningful manner.

Jay Parikh

Thanks to a collection of custom-built tools, user interfaces and visualization layers, he said, “we have a lot of [non-technical] users at Facebook that are able to run reports and view analytics [powered by Hadoop].” Already, a couple of former Facebook employees who helped invent Hive have launched Qubole, a cloud-based version of Hive that provides on-demand access to Hadoop with Hive’s signature SQL interface, but there’s a lot more that can be done.

And having easy access to Hadoop isn’t just about adding another tool to someone’s belt; ideally, it’s also about getting rid of a couple others. Doing big data right means doing it efficiently, Parikh said, so Facebook puts a lot of effort into designing tools that let users do a lot of things that previously might have required multiple products. They might be different than what users are used to, but hopefully they let users innovate even faster.

Opportunity 2: Look beyond Hadoop

Once you get outside the realm of established infrastructure tools such as Hadoop and NoSQL stores, though, things start to open up. “We have many things in the oven,” Parikh said, noting Facebook’s heavy use of MySQL, a graph database it has built and the new types of backends it had to build for Timeline and Newsfeed. “A lot of this boils down to the different needs of the project.”

Li agreed, noting the number of startups he sees that want to use Hadoop because it’s free and open source, but that end up having to do a lot of their own work to make Hadoop do what they want. Some, like Precog, which I recently profiled, opt to build their own products from scratch rather than try to fit them into Hadoop’s mold. “There are plenty of big data problems that have nothing to do with Hadoop, in my opinion,” Li said.

Still, there’s a balancing act that startups must perform when it comes to picking a platform on which to build applications. “The one thing I do caution entrepreneurs about is that good-enough is the enemy of great,” Li said, before adding that you won’t always be able to out-innovate community support for what’s already out there.

Opportunity 3: Go big. Like, data center big.

Anyone feeling particularly ambitious, however, might look at Facebook’s new deep-storage data center strategy and try to figure out a way to take that mainstream. The strategy, which emerged in August — and which I assume will come up when Parikh discusses Facebook’s infrastructure strategy at our Structure: Europe conference next month — involves designing data centers from the ground up to handle longer-term data storage for rarely accessed information instead of a steady stream of web transactions. The hardware, network and data center designs all need to be rethought for what Parikh calls the “changing temperature of data.”

The construction site for Facebook’s deep-storage data center.

“It’s not incremental [change],” he said, “I actually think it’s very different.” Energy-dense data centers that try to suck every last bit of power for computing will give way to ones that need far less power for processing, but still need to deliver data to users and analytic engines when it’s needed. Parikh calls it a huge challenge that could become very important to all businesses as more of them decide to keep data around for regulatory purposes, to serve users or just for rainy-day analytics projects.

The good news for startups: Facebook will open source some of its design work via the Open Compute Project, and some of the data-management work will manifest itself in the Apache Hadoop project. They just need to do the rest.

Feature image courtesy of Shutterstock user Anneka.