How Twitter is doing its part to democratize big data

Updated: Twitter has been on a tear lately when it comes to open sourcing big-data tools. The latest two are Cassie — a Scala client for managing Twitter’s 1,000-plus-node Cassandra cluster — and Scalding — a MapReduce framework for simplifying the creation of Hadoop jobs. If you think big data will be black magic forever, think again.

Twitter has been fairly active on the open source front for the past few years, and because it works with so much data, it has released a lot of tools for doing just that. Among its various open source contributions are Gizzard, a middleware framework for distributed databases; FlockDB, a graph database of sorts for managing the Twitter social graph; and Storm, a stream-processing engine to handle data in real time.

Among the latest two, Scalding is probably the more interesting because of the general fervor over Hadoop across the IT world. In a recent Twitter Engineering blog post, Twitter data scientist Edwin Chen described Scalding thusly:

Scalding is an in-house MapReduce framework that Twitter recently open-sourced. Like [Apache] Pig, it provides an abstraction on top of MapReduce that makes it easy to write big data jobs in a syntax that’s simple and concise. Unlike Pig, Scalding is written in pure Scala — which means all the power of Scala and the JVM is already built-in. No more UDFs, folks! …

In 140: Instead of forcing you to write raw map and reduce functions, Scalding allows you to write natural code …

Chen also illustrates some simple use cases for Scalding, such as correlating the similarities between people’s movie interests or their Foursquare checkins. In the movie example, Chen shows the code necessary to collect and parse through various data as well as a simple command to actually run the job in Hadoop.

Update: On Thursday afternoon, Twitter added to its library of open source contributions with Cassovary. Twitter’s Pankaj Gupta describes it as “a big graph-processing library for … large-scale graph mining and analysis.” At Twitter, he wrote, “Cassovary forms the bottom layer of a stack that we use to power many of our graph-based features, including ‘Who to Follow‘ and ‘Similar to.’ We also use it for relevance in Twitter Search and the algorithms that determine which Promoted Products users will see.”

The moral of this story, of course, isn’t so much what Twitter is doing as much as it is the democratization of big data technologies. From startups to large software vendors to web companies like Twitter, tools are emerging that should make analytics on large data sets doable by individuals who don’t bear the job title “data scientist.”

When we plan conferences such as Structure:Data, which takes place later this month in New York, we’re always looking toward the future. The big data space is advancing so fast, it’s difficult to tell where the cutting edge will be a few years from now. What’s next when skills such as building recommendation engines and ad-targeting systems become commonplace or, better yet, services, and when managing distributed systems becomes child’s play?

Image courtesy of