Supercomputers, Hadoop, MapReduce and the Return to a Few Big Computers

Yahoo announced yesterday it would collaborate with CRL to make supercomputing resources available to researchers in India. The announcement comes on the heels of Yahoo’s Feb. 19 claim to have the world’s largest Hadoop-based application now that it’s moved the search webmap to the Hadoop framework.

There are a number of Big Computing problems today. In addition to Internet search, cryptography, genomics, meteorology and financial modeling all require huge computing resources. In contrast to purpose-built mainframes like IBM’s Blue Gene, many of today’s biggest computers layer a framework atop commodity machines.

Google has MapReduce and the Google File System. Yahoo now uses Apache Hadoop. The [email protected] screensaver was a sort of supercomputer. And hacker botnets, such as Storm, may have millions of innocent nodes ready to work on large tasks. Big Computing is still bigit’s just built from lots of cheap pieces.

But supercomputing is heating up, driven by two related trends: On-demand computing makes it easy to build a supercomputer, if only for a short while; and software as a service means fewer instances of applications serving millions of users from a few machines. What happens next is simple economics.

Frameworks like Hadoop scale extremely well. But they still need computers. With services like Amazon’s EC2 and S3, however, those computers can be rented by the minute for large tasks. Derek Gottfrid of the New York Times used Hadoop and Amazon to create 11 million PDF documents. Combine on-demand computing with framework to scale applications and you get true utility computing. With Sun, IBM, Savvis and others introducing on-demand offerings, we’ll soon see everyone from enterprises to startups to individual hackers buying computing instead of computers.

At the same time, Software-as-a-Service models are thriving. Companies like, Rightnow and Taleo replaced enterprise applications with web-based alternatives and took away deployment and management headaches in the process. To stay alive, traditional software companies (think Oracle and Microsoft) need to change their licensing models from per-processor to per-seat or per-task. Once they do this, simple economies of scale dictate that they’ll run these applications in the cloud, on behalf of their clients. And when you’ve got that many users for an application, it’s time to run it as a supercomputing cluster.

Maybe we’ll only need a few big computers, after all. And, of course, billions of portable devices to connect to them.

Interested in Web Infrastructure? Attend our upcoming conference, Structure08 on June 25th in San Francisco