As genomics data approaches exascale, cloud could save the day

Life is about to get a lot easier for medical researchers, but a lot more difficult for companies trying to make a buck selling them tools to store and analyze genomic data. When the Human Genome Project successfully concluded in 2003, it had taken 13 years to complete its goal of fully sequencing the human genome. Earlier this month, two firms — Life Technologies (s life) and Illumina (s ilmn)– announced instruments that can do the same thing in a day, one for only $1,000. That’s likely going to mean a lot of data.

1TB times 1 million equals …

How much data is anybody’s guess, but the exponential increases in productivity suggest it will be in the exabyte range within a few years. A fully sequenced human genome results in about 100GB of raw data, although DNAnexus Founder and CEO Andreas Sundquist told me that volume increases to about 1TB by the time the genome has been analyzed. He also says we’re on pace to have 1 million genomes sequenced within the next two years. If that holds true, there will be approximately 1 million terabytes (or 1,000 petabytes, or 1 exabyte) of genome data floating around by 2014.

A few years ago, Complete Genomics (s gnom) publicly announced its plan to sequence a million genomes by 2014, but it has been woefully behind schedule to this point. It was hoping to do 50,000 genomes in 2011, but finished the year at only 3,000.

Life's Benchtop Ion Proton Sequencer

However, sequencing instruments are evolving in a manner similar to mainstream computers, which is to say they’re always getting faster and more affordable. Whereas sequencers used to cost more than half a million dollars and take up a room, Life’s genome-in-a-day instrument, the one that claims a $1,000-per-genome price point, sits on a desk and will cost only $149,000 when it’s available later this year. Upgrading to Illumina’s new instrument from the previous model costs only $50,000.

The fast rate of improvement comes from genomics’ own version of Moore’s Law, Sundquist said: data throughput and cost both improve by tenfold every 18 months. When Life rival Illumina set a world record in February 2008, it took “less than four weeks at a cost of about $100,000.” At this rate, we’ll have $100 genome sequencing by 2014.

Sundquist added that medical systems have tens of thousands of patients queued up for sequencing, many of which they might start doing now that it can be done so fast and at such a low cost.

Hidden costs: ‘The quest for the $1,000 genome interpretation’

Where things get hairy for IT vendors is figuring out how to make it affordable to store, process and analyze all that data — something Sundquist calls the quest for the $1,000 genome interpretation. It’s still not an inexpensive proposition to buy and maintain a system capable of storing and processing potentially petabytes of data. And if doctors or researchers want to collaborate with colleagues, their facilities bandwidth likely won’t cut it for sending even the raw data for a single genome. That’s why many research institutions are connecting to high-speed research networks designed solely to move massive scientific data sets.

As Forbes’ Matthew Herper opined early last year, even though research costs for genomes will soon cost only $1,000, it costs a lot more to employ people and pay for software capable of analyzing it. Because research genomes aren’t accurate enough for medical use, they often must be sequenced multiple times. Herper’s ultimate analysis:

I’d think if we’re talking about actual medical use, $10,000 is a more accurate number. Certainly, it is not going to drop below the $2,000 level for a magnetic resonance imaging scan. And once the technology is in use, I think it is possible that the costs will go back up.

So, even if genome sequencing itself becomes less expensive, hospitals and patients will both be paying well more than $1,000 for the procedure. Presently, $10,000 is about the going rate from Complete Genomics to sequence, analyze and deliver research results to an individual, although the costs certainly are subject to change if hospitals start performing sequencing workloads themselves.

Cloud computing to the rescue?

Sundquist thinks cloud computing is the answer. His company, DNAnexus, provides a cloud-based platform for storing and analyzing genomics data, something we’ve covered before. “A 100-megabit connection could more than keep up with about a dozen of these machines,” he said, and once the data is in DNAnexus’s cloud platform, institutions no longer have to worry about keeping up with exploding data volumes, sending terabytes of data across the Internet or paying software licenses. Access is centralized and everything takes place on DNAnexus’s virtual infrastructure.

Additionally, cloud computing is ideal for spiky use cases, as is generally the case with genome sequencing.  A general rule of “cloudonomics” is that the cloud costs more on a per-unit basis, but generally will cost less over time unless it’s being used for a steady workload flow better suited to an on-premise system.

Whether it’s DNAnexus or some other cloud service, Sundquist’s reasoning is sound. As prices for gene sequencing continue to fall, doctors should be increasingly likely to do it, but they’ll be limited by the infrastructure in place to support them. Unless the costs of doing this on-premise come down significantly, the cloud might be the only place where storing and analyzing potentially petabytes per hospital isn’t such a daunting undertaking.

Feature image courtesy of Flickr user Robert Gaal.