Pinterest shed more light on how the social scrapbook and visual discovery service analyzes data in real time, it said in a blog post on Wednesday, also revealing details about how it’s exploring a combination of MemSQL and Spark Streaming to improve the process.
Currently, Pinterest uses a custom-built log-collecting agent dubbed Singer that the company attaches to all of its application servers. Singer then collects all those application log files and with the help of the real-time messaging framework Apache Kafka it can transfer that data to Storm or Spark and other “custom built log readers” that “process these events in real-time.”
Pinterest also uses its own log-persistence service called Secor to read that log data moving through Kafka and then write it to Amazon S3, after which Pinterest’s “self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing,” the blog post stated.
Although this current system seems to be working decently for Pinterest, the company is also exploring how it can use MemSQL to help when people need to query the data in real time. So far, the Pinterest team has developed a prototype of a real-time data pipeline that uses Spark Streaming to pass data into MemSQL.
Here’s what this prototype looks like:
In this prototype, Pinterest can use Spark Streaming to pass the data related to each pin (along with geolocation information and what type of category does the pin belong to) to MemSQL, in which the data is then available to be queried.
For analysts that understand SQL, the prototype could be useful as a way to analyze data in real time using a mainstream language.