Major investments show promise of big data in biotech

Cloud-based DNA-sequencing specialist DNAnexus has closed a $15 million second round led by Google Ventures (s GOOG) and TPG Biotech. Elsewhere, we learned Wednesday that agribusiness giant Monsanto (s mon) has deployed Cloudant’s NoSQL database as the underpinning of the company’s genomics system. Big data technologies, it seems, can now include biotech — and genomics, specifically — among their many killer apps.

Innovation blowing past Moore’s Law

In a recent interview, DNAnexus Co-Founder and CEO Andreas Sundquist explained the opportunity he sees for his company’s services, which include cloud-based storage and processing of DNA-sequencing data. The problem and the opportunity genomics researchers face is that innovations in the field are “outpacing Moore’s Law,” he said, which has resulted in the cost of a DNA profile being pretty much on par with that of any other standard medical test. Soon, everyone will have DNA profiles as part of their medical records.

This “will change the way medicine is done” and could grow into a hundred-billion-dollar market, he explained, but it also will result in lots of data generation: hundreds of gigabytes per person. Whereas the high-end research facilities might have access to high-performance computing and storage necessary to perform DNA sequencing, the hospitals that will now be doing those analyses on a regular basis certainly will not.

And that’s where Sundquist thinks a service such as DNAnexus becomes indispensable. Leveraging the storage capacity and processing power of Amazon Web Services (s amzn) and Google Cloud Storage, his company provides sequencing facilities and researchers with the infrastructure and the software to run the analyses and display the results. Because of its investment relationship with Google and the sheer scale of its operations on AWS, Sundquist said DNAnexus actually works very closely with both companies.

However, unlike some areas where analytics are primarily focused on algorithms because access to storage is just a matter of buying more commodity gear, capacity is still a big issue in genomics. By using the cloud, Sundquist said DNAnexus lets customers share and collaborate on data without actually transferring hundreds to thousands of gigabytes.

It also helps with DNAnexus’ latest undertaking: hosting the Short/Sequence Read Archive. The comprehensive set of sequencing data was hosted by the National Center for Biotechnology Information but was slated for sunsetting because of budget cuts.

That database is currently at about 400TB, but Sundquist says that’s just the tip of the iceberg. He said DNAnexus has actually “blown past” analyzing data at the Hadoop/MapReduce scale and is now focusing on parallelizing computation across 100,000 nodes and scaling its storage infrastructure into the exabyte range.

Genomics aren’t just for humans

Say what you will about Monsanto’s business in genetically engineering food, but it does involve high science on par with DNA sequencing in humans and presents many of the same data problems. That’s why Monsanto deployed Cloudant’s BigCouch database as the focal point of its massively distributed genomics system.

According to Mike Miller, co-founder and chief scientist of Boston-based Cloudant, NoSQL offerings such as his company’s CouchDB-based product are actually ideal for the genomics space because they allow for cheap, horizontal scalability and high throughput. Cloudant is particularly well-suited, he explained, because of the incremental MapReduce engine built into BigCouch.

Miller compares it to Google Percolator, the data-processing framework Google recently deployed to replace its legacy MapReduce system. Whereas traditional MapReduce implementations such as that found in Hadoop are designed for batch processing, Percolator and Cloudant’s MapReduce implementation enable near-real-time analysis because they let users process data as it enters the system and update the dataset accordingly.

This is important for Monsanto because BigCouch isn’t just an analytics system, but an operational database serving a wide variety of users. Some users who aren’t data scientists, but consumers of the data, need up-to-date information and must rely on the system to provide it.

Ultimately, Miller paints a picture of DNA sequencing very similar to what DNAnexus’ Sundquist does. Innovation is rampant, but data growth is outpacing the ability to analyze it, making faster, cheaper and more scalable data systems integral to leveling the playing field. If they can help bridge the gap between the data and the algorithms to analyze it, Miller says, “We’re going to see things in the space beyond our wildest dreams.”

Feature image courtesy of Flickr user micahb37.