Ever wonder how Pinterest keeps track of which users are following which other users, and what your interests are? Well, wonder no more: The company detailed the architecture behind its combination user-and-interest graph in post on Friday morning.
Here are the highlights:
- The database stores a variety of information about each user, including who they follow (explicitly and implicitly), what boards they follow and unfollow, and who follows that user. It also tracks who follows and unfollows each individual board.
- Pinterest’s graph is hosted on a Redis environment, hosted in the Amazon Web Services cloud, that has been split into 8,192 shards.
- The previous architecture was a classic MySQL-plus-memcached environment, but that it was reaching its limits for Pinterest’s needs. (Facebook also got rid of MySQL and memcached for its graph database, replacing it with a system called TAO, but both sites still use MySQL and memcached for other services.)
- The overall size of Pinterest’s graph is less than 3 terabytes, which easily fits in memory (Redis is an in-memory database).
Pinterest’s Abhi Khune, who authored the blog post, also explains how Pinterest’s graph architecture differs from that of a site like Facebook or Twitter. Essentially, the difference has to do with how the sites operate: Pinterest is more like Twitter in that one user can follow many other users without reciprocation (whereas Facebook relationships are generally mutual friendships), but Pinterest must also take into account each user’s interests and weigh them acccordingly. There are explicit follows (i.e., a user clicks to follow a certain user) and implicit follows (i.e., a user clicks to follow a certain board but not the user).
There are also explicit unfollows, which occur when a user follows another user but unfollows one of her boards because it doesn’t match the follower’s interests.
It might seem a bit complex but, like most decisions by web companies, it’s all about creating a better user experience. As Khune points out, users’ homepages get more tailored as they follow and unfollow certain people and topics. And most of Pinterest’s pages show either a follower/following count or check to see if the user is following the board or user she’s currently looking at. Pinterest employees need to run queries over these relationships, as well.
Thus, it’s critical that both the data structure and the database architecture are designed to process the relationships, perform at minimal latency and be able to handle lots of concurrent users without crashing.
Khune didn’t get into how, or whether, Pinterest is using all the relationships among users and boards to perform graph analysis for the purpose of pattern detection and recommendations. It seems likely it is (although the Pinterest data scientist interviews I’ve read talk more about analyzing user behavior across devices), but that type of work and most analytic jobs are almost certainly done using Amazon’s Elastic MapReduce Hadoop service and other tools designed for analysis rather than operations.