Revolutionizing Web publishing with big data
By Derrick Harris
Building a system capable of revolutionizing analytics for some of the world’s biggest Web publishers doesn’t take a team of Ph.Ds. and thousands of servers. All it takes is a few smart people, cloud computing and a serious understanding of big data. Just ask Parse.ly.
The company, which officially launched in January and provides an SaaS application for drilling deep into publishing data, was doing some impressive things with a team of just eight employees as of early February, when I spoke with CEO Sachin Kamdar and CTO Andrew Montalenti. The result is a slick engine called Dash, used to see what content is driving traffic and to figure out what types of future content might catch fire.
Whereas some publishers have strict policies around tagging articles and some, like the New York Times, can hire data scientists to analyze and visualize traffic trends, many can’t or just don’t want to. Those are the customers Parse.ly is targeting.
Users can sort by authors, topics, sections, posts, trends and other metrics to get a real, historical understanding of their traffic beyond just seeing what posts or pages are hot at that moment. As Kamdar explained to me, users can see what posts do better across what topic pages, what pages perform better in what geographies, and what topics are trending and which have peaked, or they can find myriad other insights, using only a mouse.
They can also highlight trends using data (anonymous, of course) from the collective of Parse.ly users — which includes the Atlantic, the Next Web and U.S. News and World Report — to get a more comprehensive view of what is happening for specific topics. The best part: Parse.ly customers don’t have to do a thing to get this sort of granularity in their analytics.
Apart from the slick user experience, it is Parse.ly’s infrastructure that makes the service. CTO Montalenti told me Dash is hosted on the Amazon Web Services (s amzn) and Rackspace (s rax) cloud computing platforms and that it consists of a data aggregation layer and a processing layer. The processing layer analyzes the text of Web pages using Parse.ly’s homemade natural-language-processing system to classify authors, topics and other characteristics. The aggregation layer indexes content in near real time into predefined buckets so queries can be completed as fast as possible.
When I spoke with Kamdar and Montalenti in early February, they told me Parse.ly was processing about 700 page views per month for its customers and had crawled about 4 million unique URLs, representing years’ worth of content. But all of that content isn’t for Dash users’ eyes only.
Montalenti said Parse.ly also keeps long-term stores of publisher data to run batch analyses on later, using Amazon’s Elastic MapReduce service. This way, the team can spot long-term trends and patterns that might help improve Dash’s features or suggest new categories to add to the real-time index. In theory, he added, Parse.ly could also run custom analytics for its customers to spot patterns in their specific content that might help them figure out how to market certain content to certain users or determine the shelf life of certain topics.
In some senses, Parse.ly is the ultimate big data application in that it is both a consumer and provider of advanced analytics. Big data powers its product, but it also provides the capabilities necessary for Parse.ly to improve the product and expand its business. And for now, at least, the right techniques are letting Parse.ly do all of this with a team you could count on two hands.