How Etsy handcrafted a big data strategy

Etsy, the e-commerce site specializing in homemade and vintage goods, has grown to more than 11 million users, resulting in 25 million unique visitors and 1.1 billion page views per month, and it’s generating the data volumes to match. Today, for example, Etsy detailed some of its work with Splunk to manage and analyze up to a terabyte of machine data per day.

This is a huge increase — about 200x — since 2007, when Etsy signed on with Splunk, and was capturing a mere 5GB of data per day. Etsy’s usage of Splunk has probably evolved, too, from a focus on troubleshooting (i.e., noticing a problem and tracking down the cause) to a focus on what Splunk calls “operational intelligence.” Because users can search and analyze server logs and other machine-generated data pretty much as it streams in, they can, for example, monitor traffic patterns in real time to uncover ongoing issues that might be causing visitors to drop off pages or leave the site.

Splunk isn’t Etsy’s only big data solution — it’s also a big Hadoop user. Etsy runs dozens of Hadoop workflows each night on Amazon’s (S AMZN) cloud-based Elastic MapReduce Hadoop service. According to this very detailed (and technical) presentation (PDF here, video here) explaining Etsy’s Hadoop usage, it ran nearly 5,000 Hadoop jobs in May 2011 to analyze both internal operational data as well as external activity such as customer behavior. Etsy actually uses MATLAB within its Elastic MapReduce clusters to analyze the data and perform predictive analytics. The presentation also highlights Etsy’s experimentation with Tableau to visually display the results of its internal data after it has been cleaned up by Hadoop.

At the product level, Hadoop powers Etsy’s Taste Test feature that helps the site determine what products best suit a particular customer’s tastes. It also helps with a feature that analyzes Facebook profile information in order to let visitors shop for their friends. At Hadoop World next week, an Etsy engineer will discuss how Etsy uses Hadoop to improve its search recommendation engine.

Operationally, Hadoop helps Etsy analyze server logs to figure out what customers are doing on the site and how they’re accessing it.

Etsy, like so many other companies — especially on the web — is both drowning in data and trying to leverage it. That’s why we’re seeing such a huge focus on Hadoop among all varieties of enterprise data-management vendors, and why you can’t escape the omnipresent references to “big data.” Although, as Etsy proves, dealing with big data requires a multi-pronged approach that goes well beyond simply deploying a Hadoop cluster and watching the insights pour in.