How Facebook Brings a New Data Center Online

For Facebook, bringing its Prineville, Ore., data center online last month required more than building a specialized facility with customized servers. According to a post today on the Facebook Engineering blog, the social networking leader also undertook an effort called “Project Triforce,” which entailed spinning up a replica of the Prineville data center on an existing set of servers in Virginia to ensure the site could run smoothly across three regions without falling on its face. In true Facebook fashion, it didn’t take the task lightly.

As author Sanjeev Kumar wrote, the project included facing down a multitude of challenges, including:

Uncharted territory: The size and complexity of our infrastructure had increased so dramatically over the years that estimating the effort required to build a successful data center was no small task. The number of components in our infrastructure meant that testing each independently would be inadequate: it would be difficult to have confidence that we had full test coverage of all components, and unexpected interactions between components wouldn’t be tested. This required a more macro approach – we needed to test the entire infrastructure in an environment that resembled the Oregon data center as closely as possible.

Software complexity: Facebook has hundreds of specialized back-end services that serve products like News Feed, Search and Ads. While most of these systems were designed to work with multiple data center regions, they hadn’t been tested outside of a two-region configuration.

New configurations: Recent innovations at Facebook in using a Flashcache with MySQL allows us to achieve twice the throughput on each of our new MySQL machines, cutting our storage tier costs in half. However, this means that we need to run two MySQL instances on each machine in the new data center. This new setup was untested and required changes in the related software stacks.

Unknown unknowns: In our large complex infrastructure, the assumption that there are only two regions has crept into the system in subtle ways over the years. Such assumptions needed to be uncovered and fixed.

Time crunch: Our rapidly growing user base and traffic load meant we were working on a very tight schedule – there was very little time between when these machines became physically available to us and when they had to be ready to serve production traffic. This meant that we needed to have our software stack ready well before the hardware became available in Oregon.

To solve these problems, Facebook created the “Legend of Zelda”-named Project Triforce. The replica data center was actually an active cluster in the company’s existing Virginia data center designed to resemble a third region, Prineville, and it handled production traffic workloads. According to post author Sanjeev Kumar, the only thing Facebook simulated were databases “because we didn’t want to create a full replica of our entire set of databases.”

Facebook automated the process of configuring, provisioning and testing new infrastructure resources via a homemade tool called “Kobold.” According to Kumar, “tens of thousands of servers were provisioned, imaged and brought online in less than 30 days” and production traffic was served within 60 days. Kumar explained that using Kobold it now takes one person less than 10 minutes to turn up production traffic.

One could make a valid argument that it will have an even bigger impact than did Facebook’s Open Compute project should the company decide to open source Kobold.

Whatever comes of Kobold or any of the other methods that Facebook used to carry out Project Triforce, though, the company’s status as a webscale infrastructure innovator is now without question. From servers to Cassandra to Hadoop, and now to Kobold, Facebook just keeps building its way out of the unique infrastructural issues that its service brings about.

Zelda Triforce pumpkin image courtesy of Flickr user Visions of Domino.

Shield logo courtesy of Facebook.