Netflix analyzes a lot of data about your viewing habits

Netflix’s (s nflx) algorithms for recommending movies to customers might not be perfect, but it isn’t for lack of trying. As Netflix Senior Data Scientist Mohammad Sabah described at the Hadoop Summit on Wednesday, the company is capturing and analyzing an incredible amount of data to try and figure out what you want to watch next. It’s important work: already, Sabah said, 75 percent of users select movies based on the company’s recommendations, and Netflix wants to make that number even higher.

Here’s a taste of what Netflix is collecting, and how much:

  • More than 25 million users
  • About 30 million plays per day (and it tracks every time you rewind, fast forward and pause a movie)
  • More than 2 billion hours of streaming video watched during the last three months of 2011 alone
  • About 4 million ratings per day
  • About 3 million searches per day
  • Geo-location data
  • Device information
  • Time of day and week (it now can verify that users watch more TV shows during the week and more movies during the weekend)
  • Metadata from third parties such as Nielsen
  • Social media data from Facebook and Twitter

However, Netflix’s most-interesting use of data might be its attempts to actually analyze what’s going on in movies themselves. Sabah said it already captures JPEGs and notes the exact time that credits start rolling, and it’s looking to take into account other characteristics. It could make a lot of sense to consider things such as volume, colors and scenery that might give valuable signals about what viewers like.

Capturing data is easy. Predictions are hard.

But even with all this data, figuring out what users want to watch is hard. Sabah illustrated this using one of Netflix’s many personalization algorithms that tries to predict what users will watch next by figuring out what movies normally follow other movies. A very simplified example of one would be that if 30 people watched The Fighter and 24 of them followed it by watching Mad Men, the transition probability between those two pieces of content is .80.

Of course, it’s not that simple, Sabah explained. In some cases, for instance, a popularity bias will arise that artificially skews a recommendation toward popular movies or TV shows rather than what’s really relevant based on a viewer’s interests. Terminator might be followed by Big Daddy followed by Family Guy followed by Hot Tub Time Machine — four pieces of content for which the most-prominent linking factor is their overall popularity. Popularity does matter when recommending movies, though, so Netflix must account for it by factoring it into the transition algorithm.

Complicating things even more is that, contractually, movies are only available for streaming and only show up on the landing page for certain periods of time. So, recommendations not only have to be relevant, they also have to be available.

Sabah said the ultimate goal is to show Netflix customers content they’ll view to completion and then recommend the next thing they’ll view to completion (opposed to, presumably, the current collection of  lists displaying “More like …” or “Top 10 for …”). But, clearly, it’s not there yet.

Feature image courtesy of Flickr user roblawton.