Parallel Programming in the Age of Big Data

We’re now entering what I call the “Industrial Revolution of Data,” where the majority of data will be stamped out by machines: software logs, cameras, microphones, RFID readers, wireless sensor networks and so on. These machines generate data a lot faster than people can, and their production rates will grow exponentially with Moore’s Law. Storing this data is cheap, and it can be mined for valuable information.

In this context, there is some good news for parallel programming. Data analysis software parallelizes fairly naturally. In fact, software written in SQL has been running in parallel for more than 20 years. But with “Big Data” now becoming a reality, more programmers are interested in building programs on the parallel model — and they often find SQL an unfamiliar and restrictive way to wrangle data and write code. The biggest game-changer to come along is MapReduce, the parallel programming framework that has gained prominence thanks to its use at web search companies.

parallel-dataflowTo understand where we’re headed with parallel software, let’s look at what the computer industry has already accomplished. The branch of parallel research that has had the most success in the field is parallel databases. Rather than requiring the programmer to unravel an algorithm into separate threads to be run on separate cores, parallel databases let them chop up the input data tables into pieces, and pump each piece through the same single-machine program on each processor. This “parallel dataflow” model makes programming a parallel machine as easy as programming a single machine. And it works on “shared-nothing” clusters of computers in a data center: The machines involved can communicate via simple streams of data messages, without a need for an expensive shared RAM or disk infrastructure.

The MapReduce programming model has turned a new page in the parallelism story. In the late 1990s, pioneering web search companies built new parallel software infrastructure to manage web crawls and indexes. As part of this effort, they were forced to reinvent parallel databases –- in large part because the commercial database products at the time did not handle their workload well. Like SQL, the MapReduce framework is a parallel dataflow system that works by partitioning data across machines, each of which runs the same single-node logic.

SQL provides a higher-level language that is more flexible and optimizable, but less familiar to many programmers. MapReduce largely asks programmers to write traditional code, in languages like C, Java, Python and Perl. In addition to its familiar syntax, MapReduce allows programs to be written to and read from traditional files in a filesystem, rather than requiring database schema definitions. MapReduce is such a compelling entryway into parallel programming that it is being used to nurture a new generation of parallel programmers. Every Berkeley computer science undergraduate now learns MapReduce, and other schools have undertaken similar programs. Industry is eagerly supporting these efforts.

Technically speaking, SQL has some advantages over MapReduce, including natural combinations of multiple data sets, and the opportunity for deep code analysis and just-in-time query optimizations. In that context, one of the most exciting developments on the scene is the emergence of platforms that provide both SQL and MapReduce interfaces within a single runtime environment. These are especially useful when they support parallel access to both database tables and filesystem files from either language. Examples of these frameworks include the commercial Greenplum system (which provides all of the above), the commercial Aster Data system (which provides SQL and MapReduce over database tables), and the open-source Hive framework from Facebook (which provides a SQL-like language over files, layered on the open-source Hadoop MapReduce engine.)

MapReduce has brought a new wave of excited, bright developers to the challenge of writing parallel programs against Big Data. This is critical: a revolution in parallel software development can only be achieved by a broad base of enthusiastic, productive programmers. The new combined platforms for data parallelism expand the options for these programmers and should foster synergies between the SQL and MapReduce communities. Longer term, these Big Data approaches to parallelism might provide the key to keeping other sectors of the software industry on track with Moore’s Law.

Joe Hellerstein is a professor of Computer Science at the University of California Berkeley and has written a white paper with more detail on this topic.

Slide courtesy of Green Plum