How Yahoo, Facebook, Amazon & Google Think About Big Data

[qi:gigaom_icon_cloud-computing] Collectively, Yahoo (s yhoo), Facebook, Amazon (s amzn) and Google (s goog) are rewriting the handbook for big data. Startups intending to reach these proportions must also change their thinking about data, and enterprises need this model for internal deployments as a way to retain an economic edge.The four leading web giants have designed systems from scratch, evidence that workloads have altered, business models are different, and economies have changed — all demanding a new approach.

Yahoo revealed a few weeks ago how it approaches unstructured data on an Internet scale with MObStor, the technology that “grew out of Yahoo Photos” but now serves the unstructured storage needs across the company. Earlier this year, Facebook unveiled Haystack, its solution to managing its growing photo collection (which could reach 100 billion photos in 2009 if it continues with current growth rates). In 2007, Amazon outlined Dynamo, an “incrementally scalable, highly available key-value storage system.” All of these were predated by The Google File System, presented as a research paper in October 2003.

While none of these systems are exactly alike, together they represent a complete change from traditional file systems and data stores. The Google GFS authors note that their design “reflects a marked departure from some earlier file system assumptions,” causing them to “re-examine traditional choices and explore radically different design points.” These are not the systems we once knew.

Since MObStor, based on when information was released, is the new kid on the block, let’s take a look at some of its standout characteristics:

  • It’s designed for petabyte-scale content that is site-generated, partner-generated, or user-generated
  • Handles tens of thousands of page views every second
  • Unstructured storage/objects are mostly images, videos, CSS, and JavaScript libraries
  • Reads dominate writes (most data is WORM: write-once read-many)
  • Only a low level of consistency is required
  • It is designed to scale quickly and efficiently

These capabilities ensure that Yahoo can maintain its ability to store and monetize content effectively, and they are a far cry from solutions developed just 5-10 years ago. The scale, load, file types, read/write pattern, and consistency requirements represent another world compared with conventional enterprise solutions.

Perhaps as part of a migration effort, Yahoo’s MObStor incorporates existing storage systems, like NAS filers. This makes sense for Yahoo, which over the years has been one of NetApp’s largest customers. Facebook has jettisoned any attachment to storage devices other than commodity servers with internal drives, at least in Yahoo’s description of Haystack and the Facebook engineering blog post. And Amazon and Google appear to have made this all-commodity move long ago.

The telling shift is the overwhelming focus on smart software on inexpensive servers. This is not how storage industry giants like EMC (s emc), IBM (s ibm), HDS and NetApp (s ntap) were born. But if the advance of Internet computing continues, the Goliath web properties will provide a crystal ball to how we will more broadly handle unstructured data on an Internet scale. Startups reliant on big data for their business have little choice but to innovate as well, finding ways to accelerate time to market and maintain outstanding service. Enterprises handling big data will need to modify their approach, too, otherwise they leave the door open to competitors that will take advantage of these cloud infrastructure economics.