MIT researchers teach computers to learn what’s happening in videos

MIT researchers have developed a method for identifying what’s happening in video files by taking a lesson from approaches to understanding what’s happening in language. It’s potentially very important research. Advances in computer vision are opening our eyes to what’s possible when we’re able to analyze still images, but video provides much more context and therefore promises an even greater depth of understanding.

The new approach to video analysis takes a page from approaches to textual analysis, such as natural language processing, that examine each part of a piece of content in order to figure out what the whole thing means. With a sentence of text, for example, algorithms can identify which words are the nouns, verbs, adjectives and other parts of speech, and then determine what the combination of those words and their order means. For video, the MIT researchers’ algorithms identify the things happening in individual frames and then determine what those mean when combined in a particular order.

As one might expect, identifying the actions taking place within videos is a machine learning problem. This isn’t a wholly unsupervised deep learning system like some of the ones working on object recognition, although it does require the computer to teach itself certain things. The algorithm was trained on videos of specific actions, but it had to learn on its own which steps comprise a larger action (e.g., making tea or lifting weights) and the normal flow from one step to the next.

Source: Jose-Luis Olivares/MIT (video stills from Nesnad/Wikimedia Commons)
Source: Jose-Luis Olivares/MIT (video stills from Nesnad/Wikimedia Commons)

This type of algorithm could prove very effective for helping to tag and index collections of online videos (think poorly labeled YouTube videos or cell phone videos, for example), but it appears the researchers are targeting even bigger applications. Because the algorithm is computationally efficient and good at predicting events based on partially completed actions, it could identify actions even from streaming video sources. The researchers cite some specific medical uses — including monitoring exercise form or whether people remember to take their medication — but it could theoretically be applied to anything from spotting an armed robbery at an ATM to alerting zoologists that the panda bears at the zoo are breeding.

However, if this sort of algorithm sounds amazing, it won’t for long. We’ve already covered other interesting approaches to solve video analysis with machine learning, including another research project that learns the theme of videos in order to make brief summaries of them. Dropcam, a startup doing cloud-connected cameras, is working on its own approach to identifying what normal and what’s anomalous in the areas its cameras are monitoring. Dropcam CEO Greg Duffy will explain the technological underpinnings of its service at our Structure conference next month in San Francisco.

More broadly, it’s clear that video will soon become just as important a source of data as text and videos for smart companies and other institutions trying to glean insights in any numbers of areas. There was always lots of information buried within tweets, photos and videos, but few organizations had the manpower to look at them all. Thanks to advances in artificial intelligence, they’ll soon only need a credit card.