Microsoft is putting its speech-recognition expertise into action on its Azure cloud platform with a new service that lets users index and search their audio and visual files based on the words that are spoken in them. The new service, called the Microsoft Azure Media Services Indexer, is the materialization of a Microsoft Research project called MAVIS.
The way the indexer works is to listen to a user’s content and extract keywords as metadata, which can then be used for a variety of things. Search is the probably the most obvious one, but the metadata could also be used to categorize content or, [company]Microsoft[/company] claims, add descriptions or captions to it. This will help people discover content and gain a sense of what’s in it, but will also help content creators bring some order to their digital libraries and possibly make more money off of them as they start matching ads to keywords and concepts.
While the resulting indexes aren’t particularly high-tech as far as database applications go, the speech recognition capabilities are based on deep learning — the same set of techniques that power the upcoming real-time translation feature in Microsoft’s Skype application. Assuming the Azure indexing service is English-only right now, Microsoft’s work in translating languages would seem to support the idea of it expanding across languages at some point.

It also wouldn’t be surprising to see these capabilities come to Bing, if they’re not there in some capacity already. [company]Google[/company] has done some similar, although more limited, applications of speech recognition to video in the past. In 2008, for example, it indexed politicians’ YouTube videos so viewers could search their speeches by keyword. Currently, YouTube users can also use the service to add automated captions to their videos.
It seems pretty clear, though, that commercial speech-recognition services are just the first step in the quest for companies such as Microsoft, Google, Facebook and — clearly — Baidu to help users navigate through the rich media they’re creating and consuming. Computer vision has received a lot of attention recently as companies ramp up their efforts to recognize what appears in photos and videos (try, for example, searching your unlabeled Google+ photos by keyword, or using the product-recognition feature on the Amazon Fire phone), and in some cases actually piecing together video frames to create short visual summaries or highlight reels.
When you consider how much audio and visual content we’re producing, it becomes really easy to understand why speech recognition, computer vision and language understanding, and the techniques for achieving them, are such hot topics right now. The web — and even corporate servers — isn’t just full of text pages anymore, and there’s only so much we can rely on manually created metadata as we’re swimming in YouTube clips, Dropcam footage, Netflix movies, Flickr photos, surveillance tapes and a whole sea of other unlabeled or poorly labeled content.
Companies in the business of delivering content, or even just information, stand to make a lot of money if they’re able to help consumers or businesses wade through it all and find what they need.