The lab that created Spark wants to speed up everything, including cures for cancer

Unknown to most of the world, the University of California, Berkeley’s AMPLab has already left an indelible mark on world of information technology, and even the web. But we haven’t yet experienced full impact of the group, which launched in 2011 under a six-year grant. Not even close.

AMPLab (the AMP in which is short for algorithms, machines and people) is best known for pushing projects such as Mesos and Spark (which were created in 2009 as part of a predecessor group called RAD Lab) into the mainstream. The former powers much of the resource management and automation for sites such as Twitter and Airbnb, and looks like it will play a significant role in the Google-ization of data centers as container-based application architectures take off.

The latter, Spark, has taken the data-processing world by storm as a faster, easier and more flexible framework for a wide variety of tasks. Spark’s creators and backers promised it can tackle everything from batch process to stream processing and SQL queries to machine learning jobs, without the performance and overhead complexity of tools such as MapReduce and Storm. Even Hadoop vendors, which for business are trying not to oversell Spark’s promise, are falling over themselves to support it.

Ion Stoica
Ion Stoica explains the AMPLan philosophy. Photo: Derrick Harris / Gigaom

In a recent visit to the AMPLab offices, Ion Stoica, one of AMPLab’s directors and the founder and CEO of Spark startup Databricks, explained what’s on the horizon for Spark. He’s particularly excited about Tachyon, an in-memory distributed file system that isn’t yet part of the official Apache Spark distribution but will be become an Apache project soon. From a commercial perspective, SQL engines such as Spark SQL and BlinkDB show promise as potential — albeit still-under-development — alternatives to many of the current SQL-on-Hadoop offerings from companies such as Cloudera, Hortonworks and even Oracle.

However, Stoica said, with the foundational computing, storage and analytic pieces pretty much in place, the future of AMPLab will probably focus more on applications. Some of those projects will make it easier to build and serve advanced applications atop that data-processing platform. Others will be applications themselves, such as tools AMPLab has created in partnership with the medical community that are already saving lives.

Making big data really small

Because Spark relies on storing data in memory in order to deliver such fast computations, the AMPLab team is understandably focused on ways to store more of it in a smaller footprint. Memory provides fast data access, but it’s also expensive.

One of the lab’s most-promising bets is on a project called Succinct, a new type of in-memory database (built atop Tachyon) that’s able to run relatively complex queries on compressed data without first decompressing the data. This results in performance similar to a NoSQL key-value store, but capabilities more commonly found in more-robust systems. Succinct does more than “put,” “get” and “delete,” AMPLab researcher Rachit Agarwal said — it also allows for counts, substring queries, range queries and more.

Normally, this type of functionality could be achieved using a secondary index, but those are much larger than the original datasets and so don’t fit easily into memory. Succinct compresses the original file, “but in a way that is also an index,” Agarwal explained.

In tests of a prototype system, Succinct blew away popular key-value stores MongoDB, Cassandra and HyperDex in terms of storage efficiency. Storing a 100-gigabyte dataset in memory with those systems required spreading the load across 16 machines with 64 gigabytes of RAM apiece. That’s because the secondary indices were 10 times the size of the original.

Succinct fit 123 gigabytes of raw data onto a single 64-gigabyte machine, Agarwal said.

Rachit Agarwal
Rachit Agarwal explains how Succinct works. Photo: Derrick Harris / Gigaom

So far, he said, the team hasn’t pushed the limits of Succinct’s scalability too much, stopping at about a terabyte of data across 16 nodes. But the evidence suggests it can scale reasonably well, even if it doesn’t actually need to.

“What you could do previously with 1,000 machines, Succinct allows you to do in 100 machines,” Agarwal noted.

But he’s trying not to put the cart in front of the horse and push Succinct as a viable alternative to anything just yet. For one, there’s still a lot of work to be done designing the system around the compression and indexing techniques. And before Succinct is able to run as an end-to-end system on its own, it first needs to be ported to Spark, which is already a functioning distributed system.

What’s more is that Succinct doesn’t provide much of a performance boost right now over existing key-value stores. Agarwal said it’s slightly faster on some queries and slightly slower on others. If money is no object, or if someone’s data volumes are small enough to begin with, running a more-mature database like MongoDB in memory probably makes more sense.

However, Agarwal said Succinct isn’t stopping at the handful of query types it already supports. In fact, the team is working on a SQL interface that he expects will be ready within a year. “You actually [could] execute SQL queries directly on Succinct,” he explained. “… For the user, everything looks opaque. For him, it doesn’t matter whether there’s compression or not.”

The family of projects currently under development within AMPLab. Source: AMPLab

Machine learning for the masses

Another big area of focus for AMPLab is around machine learning — specifically, simplifying the process of actually building usable models. One of the more well-known projects in this space is GraphX, an attempt to build a graph-processing engine for Spark that would let it natively run graph applications and pipelines. It’s similar in scope to what GraphLab is trying to do with its new Create offering, only it’s open source and designed to run on Spark clusters.

Joseph Gonzalez, a researcher on the GraphX team, who previously co-founded GraphLab while at the University of Washington, thinks GraphX will really hit its stride thanks to a related project called Velox. That project aims to take machine learning models out of big data servers and onto web servers, where the models could learn iteratively as new data flows in. This way, for example, recommendations would update as shoppers perused a site rather than daily or weekly after the data team updates the model.

Data scientists and researchers are great at building models, he said, but “getting the model to serving is a less-studied problem. … It’s hard to sell a model.”

Ben Recht
Ben Recht wants to simplify the process of machine learning. Photo: Derrick Harris / Gigaom

Higher up the machine learning food chain, AMPLab faculty member Ben Recht wants to make it easier to build and deploy models, too, especially for complex tasks across lots of data. Now that we have such powerful computers and so much training data, there’s an opportunity to test out ideas conceived when we had 2,000 data points and make them work across 2 billion points.

“I think most of the problems in machine learning right now are just issues in scale,” he said. “… It’s not a simple problem to scale these algorithms.”

Recht, who helped build the University of Wisconsin project HOGWILD! that underpins Microsoft’s recent deep learning advances under the Project Adam moniker, thinks the answer is in making models modular. In areas like computer vision, he explained, the steps are understood well enough but each one — acquiring data, normalizing it, selecting features, et cetera — has its own litany of options, decisions and tuning that needs to be done. Anyone looking for guidance in research papers might find they’ve stumbled across an approach tailor-made for a specific dataset that doesn’t apply anywhere else.

“It becomes a bit of a nightmare,” he said. “… What we’d like to make it easy to do is to declare what you want to do and then deploy it on 120 machines.”

Breaking down just one step in the object recognition pipeline, which might require tuning a huge number of parameters. Source: Ben Recht

A project like this one, dubbed ML Pipelines, will never be able to take the expertise requirement out of advanced machine learning and artificial intelligence, but Recht hopes it can open the door for letting more people do more things. He knows that making that happen will require letting the mathematicians and distributed systems experts speak the same language to some degree.

“I want to be able to do the things I do in numerical Python on bigger data,” he said, which means the people running those systems probably want to be able to experiment with machine learning, too.

RISC, RAID and now cancer

And then there’s David Patterson, the UC Berkeley professor and AMPLab faculty member that led the creation of the RISC processor architecture (the foundation of the Sun/Oracle SPARC processor) and the RAID storage architecture in the 1980s, and who for the past few years has been focused on bringing big data systems to bear on difficult diseases. He and Spark co-creator Matei Zaharia built a DNA sequence aligner, called SNAP, that Patterson boasts “is the world’s fastest aligner today.”

SNAP is built on Spark, and it’s already saving lives. Patterson spoke about a recent case where a boy in Wisconsin was suffering from a mysterious illness that had him trapped in a coma for weeks with brain swelling. He was sent to the University of California, San Francisco, where doctors worked with Patterson and AMPLab to process a sample of his DNA using SNAP. In about 90 minutes, the computer had isolated all the human elements, leaving just the .02 percent that wasn’t. It belonged to a rare bacterium, which was treated immediately.

“How do you find the needle in the haystack?” Patterson asked. “Get rid of all the hay.”

Patterson thinks there’s a natural symbiosis between necessarily conservative doctors and necessarily change-embracing computer scientists that hasn’t always been exploited to its full potential, but the time is right for change. With many fundamental computer science problems already solved, computer scientists who want to make a more direct impact on society are looking at where they can apply the technologies and creative cultures they’ve already developed.

“Suppose we came up with something that saves 100 lives a year,” Patterson said, paraphrasing a colleague who laid out the opportunity in medicine. “That could justify your whole career.”

David Patterson
David Patterson has accomplished a lot, but his latest quest is his biggest yet. Photo: Derrick Harris / Gigaom

He thinks doctors working in the field with real patients are particularly willing to work with folks like him, because they understand better than anyone that time matters when lives are on the line. Patterson said a standard DNA sequencer can take up to 24 hours to run. “It’s surprising for me to see programs where the unit of time is hours,” he said.

But speed isn’t the only thing that matters. Patterson’s latest target is a type of cancer called acute myeloid leukemia. He and his peers are trying to help doctors fight the disease by analyzing whether, and which, multi-drug treatments could help fight patients’ unique sets of mutations better than traditional treatments that often target only a single mutation. The team has read some books, developed a basic understanding of the biology and the human toll of the disease, and is ready to unleash some serious computing on it.

“It’s not clear we can do this,” Patterson said, “but it’s such a terrible disease that we’ve got to try.”

Despite some early successes, though, there’s much more that could be done if Patterson’s powerful software could only get enough data to crunch. AMPLab is working with others, including University of California, Santa Cruz, computational biologist David Haussler to try and open access to cancer genetic data in a way won’t offend privacy expectations.

“It’s one of the cases where, absolutely, if we collect the data together, there’s going to be tremendous progress on these terrible diseases,” Patterson said.

He noted that a new faculty member has ideas about computing on encrypted data that could help bridge the current gap between what researchers want and what regulations allow them to get. Perhaps work another AMPLab member is doing to develop a new genetic data storage format could help.

“The obstacle isn’t the cost,” Patterson said, nor is it the technology. “It’s the policy issues.”