MapReduce was fun and pretty useful while it lasted, but it looks like Spark is set to take the reins as the primary processing framework for new Hadoop workloads. The technology took a meaningful, if not huge, step toward that end on Thursday when the Apache Software Foundation announced that Spark is now a top-level project.
Spark has already garnered a large and vocal community of users and contributors because it’s faster than MapReduce (in memory and on disk) and easier to program. This means it’s well suited for next-generation big data applications that might require lower-latency queries, real-time processing or iterative computations on the same data (i.e., machine learning). Spark’s creators from the University of California, Berkeley, have created a company called Databricks to commercialize the technology.
Spark is technically a standalone project, but it was always designed to work with the Hadoop Distributed File System. It can run directly on HDFS, inside MapReduce and, thanks to YARN, it can now run alongside MapReduce jobs on the same cluster. In fact, Hadoop pioneer Cloudera is now providing enterprise support for customers that want to use Spark.
However, MapReduce isn’t yesterday’s news quite yet. Although many new workloads and projects (such as Hortonworks’ Stinger) use alternative processing frameworks, there’s still a lot of tooling for MapReduce that Spark doesn’t have yet (e.g., Pig and Cascading), and MapReduce is still quite good for certain batch jobs. Plus, as Cloudera co-founder and Chief Strategy Officer Mike Olson explained in a recent Structure Show podcast (embedded below), there are a lot of legacy MapReduce workloads that aren’t going anywhere anytime soon even as Spark takes off.
If you want to hear more about Spark and its role in the future of Hadoop, come to our Structure Data conference March 19-20 in New York. Databricks co-founder and CEO Ion Stoica will be speaking as part of our Structure Data Awards presentation, and we’ll have the CEOs of Cloudera, Hortonworks, and Pivotal talking about the future of big data platforms and how they plan to capitalize on them.
[soundcloud url=”https://api.soundcloud.com/tracks/134475418?secret_token=s-1R5EO” params=”color=ff5500&auto_play=false&hide_related=false&show_artwork=true” width=”100%” height=”166″ iframe=”true” /]
Featured image from Thinkstock/Loops7