Under the covers of eBay’s big data operation

For online auction powerhouse eBay (s ebay), big data is serious business. The company has 100 million active users globally, 300 million live listings at any time (and it archives them all), receives 2 billion page views daily, and handles 250 million search queries and 75 billion database calls a day. How does eBay make sense of all this activity? With Hadoop, of course.

What a customer (or engineer) wants

Hugh Williams

Hugh Williams is VP of experience, search and platforms at eBay. His team is responsible for the entire eBay experience from the moment users hit the site until moment they make a purchase, from code to data center automation to building new picture-hosting platforms. If it has to do with driving traffic to eBay and improving the customer experience, Williams’ team builds it. But in order to know what to build and how to build it, the team needs insight into what customers want and what they’re doing.

In order to figure this out, eBay first has to give its analysts and engineers the tools they want. It does this by operating a two-pronged big data attack consisting of a massive Teradata (s tdc) data warehouse and a fast-growing Hadoop environment.  Financial analysts like SQL and more of a WYSIWYG experience, Williams said, which is why Teradata is so important. However, the majority of his engineers love Hadoop — which stores and processes unstructured data such as server logs, click-throughs and search queries — and make “enormous use” of it.

Huge data

Whichever one you’re talking about, Williams says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years worth of HD video) within a year. Its Hadoop environment is currently storing between 9 and 10 petabytes, according to Williams, but always growing. In fact, the Hadoop environment doubled in size in the past year, in part from more user data streaming in and in part from analysts running lots of Hadoop jobs and creating new, larger data sets that also remain in the system.

“What we really use Hadoop for is to understand our customers and their needs,” Williams said. This happens both at a broad scale — say, improving the accuracy of its search engine — and also more narrowly around building specific features the data suggests customers would want. For example, Williams explained, Hadoop has proven helpful in deciphering patterns of misspelled words, so now eBay’s search engine knows to look instead for an actual word or product when users type certain queries incorrectly. In the middle, between broad improvements and narrow data-driven features, Williams said Hadoop helps eBay find out a lot about how it’s different and how it can become more unique by letting Williams’s team churn through those petabytes of unstructured data to uncover trends.

More than MapReduce

Beyond Hadoop’s sweet spot as a batch-processing engine using its native MapReduce framework (i.e., processing large data sets) Williams said eBay is also expanding its own Hadoop usage rather heavily into HBase, the NoSQL database that’s also an Apache Software Foundation project and leverages the Hadoop Distributed File System. HDFS, which is the default storage layer for Hadoop, also serves as the storage layer for HBase, which doesn’t process data like MapReduce but lets users quickly read from and write to large unstructured data sets.

HBase is already a piece of eBay’s new search engine, and Williams said there are few sites using it in production at eBay’s scale. Facebook is another site already making major use of HBase. Williams said HBase is fantastic, but it’s also the area within the Hadoop ecosystem where he’d like to see the most improvement. It’s fundamentally real-time, he explained, which is great, but eBay had to do a lot of work to make HBase scale and to make it fault-tolerant. Build a self-healing system out of Hadoop subprojects was very challenging.

Actually, Williams is generally excited about NoSQL, which refers to non-relational database technologies, as a way to handle eBay’s high traffic in data not necessarily ideal for traditional databases. “Cassandra and MongoDB are other great examples of the latest, innovative technologies for managing large data sets that we’re excited about at eBay,” he said.

Open source all the way … probably

For all its benefits, Williams acknowledges Hadoop can be a tough technology to learn, but any blood, sweat and tears are worth it to ensure his team really understands the data platform that underpins so much of eBay. “[T]o put it to its full potential, we have to be experts in it,” William said — a level of expertise that can really only come via open-source software that lets engineers “roll up [their] sleeves and [get] into the source code.”

Still, any sort of decision is the result of collaboration between the business team and the technology team, so Williams says he keeps an open mind as to how eBay’s big data environment might evolve. Right now it’s Teradata and Hadoop, but “I can imagine that landscape changing,” Williams said.

In October, we covered comments from eBay Senior Director of E-commerce Darren Bruntz, who said he would like to move to a single data platform and that he’d like to see “more focus and energy” from the Hadoop community. Asked at the time about whether such a platform is possible, Teradata Labs President Scott Gnau told me it’s not possible now — at least if you want all the advanced SQL analysis features of a product like Teradata for structured data — but that it might be in the future.

And although Teradata now has a product in Aster Data Systems that is something of a replacement for Hadoop, Gnau said “Hadoop or son of Hadoop or something else” will always be a big piece of the big data space because it has so much momentum and such a sweet spot around search and batch processing of unstructured data.

EBay’s Williams, though, maintains the sentiment of his team members will remain a major factor in any decision regarding the company’s data platform. “For a new platform to succeed, our technologists would have to be passionate about the platform, and the platform would have to enable us to innovate faster to build products for eBay’s customers,” he said. “If a new technology helps us achieve that goal, we would certainly evaluate the benefits.”

We’ll be talking a lot more about Hadoop, NoSQL and where they’re headed at our Structure: Data conference, which takes place March 21-22 in New York City. Speakers include some of the biggest names and brightest stars in the space, all of whom are trying to push the limits of what organizations can do with all the data they collect.