Quantcast releases bigger, faster, stronger Hadoop file system

The Quantcast File System is like the Six-Million Dollar Man of distributed data stores for Hadoop. An implementation of the Kosmix Distributed File System (aka CloudStore) that had largely been written off and forgotten, Quantcast has built QFS to be bigger, faster and stronger than the Hadoop Distributed File System most commonly associated with the popular big data platform. Now, QFS is open source and ready for use in the webscale world.

According to Quantcast VP of Research and Development Jim Kelly, the web-audience measurement specialist began working with Hadoop in 2006 and experienced problems almost from the start. However, while the early problems with HDFS might have been symptoms of its immaturity, the problems soon began centering around the two things Hadoop is supposed to be best at — size and speed. So, in 2008, Quantcast began experimenting with, and actually sponsoring, the Kosmix project.

It turns out that wasn’t a moment too soon. By 2010, after Quantcast began integrating with ad networks, its data flow really began picking up into the tens of terabytes a day range. It turned on QFS as its production Hadoop file system in 2011 and now receives about 40TB a day and processes a whopping 20 petabytes. Kelly said Quantcast has pushed 4 exabytes — or 4 billion gigabytes — through QFS since turning it on.

Faster, yes. Bigger, not so much.

At Quantcast’s scale, the problem with HDFS wasn’t so much its scalability, but the sheer size of the cluster required to handle petabyte-scale data stores. HDFS stores three copies of each piece of data to ensure they’re always available, although it tries to make up for the size issue with data locality (i.e., putting data directly on the computing nodes so it doesn’t have to traverse the network in order to be processed). Kelly thinks those techniques are relics of a bygone era.

“When HDFS [was created], disk drives and networks were tied for being the slowest things in the cluster,” he said.

Enter Reed-Solomon error correction, QFS’s chosen method for assuring reliable access to data that Kelly says actually ends up shrinking the size of Hadoop clusters while improving their performance. (It’s actually the same method used on CDs and DVDs.) Rather than storing three full versions of each file like HDFS, resulting in the need for three times more storage, QFS only needs 1.5x the raw capacity because it stripes data across nine different disk drives. Quantcast believes smaller cluster size, combined with today’s 10-gigabit networks and the ability to read and write data in parallel make QFS significantly faster than HDFS at large scale.

QFS also comes equipped with other features that Quantcast had to implement to make it production-ready. Among them: it is written in C++ and has fixed-footprint memory management; it has access control based on users and groups; and it intelligently detects node failures, as opposed to planned maintenance, and invokes data recovery accordingly.

It’s not for everyone, though

Despite its claimed improvements over HDFS though, Kelly is quick to point out that QFS is probably not the best choice for everyone. It’s really designed for Hadoop users operating at petabyte scale, who have the technical prowess to handle a migration away from HDFS, and for whom data-processing costs are hitting the six-to-seven-figure range monthly once things such as energy bills accounted for.

“If you’re cluster only has 10 disk drives,” Kelly said, “[QFS] will save you $500, which is nice but …”

Likewise, if high availability is very important, the latest version of HDFS might be preferable. “There’s a standby [in QFS]; it’s not quite as hot as theirs,” Kelly said. But availability isn’t super important to Quantcast, he said, it hasn’t had any real problems with QFS going down anyhow. When it does, it actually recovers pretty fast.

As for the rest of the file systems touting themselves as better alternatives for HDFS, Kelly didn’t have much to say. Quantcast’s efforts are focused on mega-scale Hadoop deployments, and he doesn’t see anything better for that use case. Although, he noted, Hadoop vendors probably shouldn’t get too upset over all the competition.

“I think some diversity in the ecosystem is probably not a bad thing,” he said, “and is probably a sign of healthy evolution.”

Feature image courtesy of Shutterstock user Lobke Peers.