Welcome to Berkeley: Where Hadoop isn’t nearly fast enough

Tucked within the computer science deparment at the University of California, Berkeley, there’s an institution called AMPLab that’s making a name for itself by — among other things — essentially rebuilding the Hadoop platform, only faster.

Results for linear regression test
Results for linear regression test

AMPLab’s most well-known product in the big data space, called Spark, is an in-memory parallel processing framework that’s comparable to Hadoop MapReduce except, its creators claim, it is up to 100 times faster. Because it runs in-memory, Spark might be comparable with something like Druid or SAP’s HANA system, too. Spark is the processing engine that powers ClearStory’s next-generation analytics and visualization service.

Like Hive as a data warehouse for Hadoop? Then you’ll love Shark, which is short for “Hive on Spark.”

Even as Hadoop gets more flexible thanks to new features such as YARN, which would technically allow it to run an alternative framework like Spark, AMPLab has its own cluster-management project called Mesos. Twitter is a big fan of Mesos, which is also an Apache Incubator project.

AMPLab’s latest target is the Hadoop Distributed File System, or HDFS. HDFS has long been criticized for availability and speed, so AMPLab created an alternative called Tachyon (hat tip to High Scalability for calling my attention to it). According to the Tachyon homepage, “it offers up to 300 times higher throughput than HDFS, by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed.”

AMPLab isn’t the first to question the cult of HDFS, though. There are numerous commercial options available, and Quantcast built its own open source file system that it claims is faster and more efficient when running at massive scale.

But it’s probably unfair to call AMPLab’s projects competitors to Hadoop. They’re certainly alternatives, but they’re also complementary, as Twitter’s heavy use of Hadoop and Mesos demonstrates. And Spark, Shark, Mesos and Tachyon are all compatible with their peer projects from the Apache Hadoop project.

Really, AMPLab is doing what any research institution does by pushing the limits of the current commercially available software. If it happens to disrupt the status quo, then so be it. For users, though, it’s just providing some new options to play around with as they try to find the right tool for their particular jobs. Its partners and sponsors, including Google(s goog), Facebook(s fb), Microsoft(s msft) and Amazon(s amzn) Web Services, certainly have an interest in finding the best-possible technology, or creating it if necessary.

The MLBase architecture.
The MLBase architecture.

Other related AMPLab projects include PIQL, a SQL-like query language that sits atop a key-value store; MLBase, a system for doing machine learning on distributed systems; Akaros, an operating system for manycore and large SMP systems; and Sparrow, a cluster-scheduling system designed for low-latency computing.