Baidu claims deep learning breakthrough with Deep Speech

Chinese search engine giant Baidu says it has developed a speech recognition system, called Deep Speech, the likes of which has never been seen, especially in noisy environments. In restaurant settings and other loud places where other commercial speech recognition systems fail, the deep learning model proved accurate nearly 81 percent of the time.

That might not sound too great, but consider the alternative: commercial speech-recognition APIs against which Deep Speech was tested, including those for [company]Microsoft[/company] Bing, [company]Google[/company] and Wit.AI, topped out at nearly 65 percent accuracy in noisy environments. Those results probably underestimate the difference in accuracy, said [company]Baidu[/company] Chief Scientist Andrew Ng, who worked on Deep Speech along with colleagues at the company’s artificial intelligence lab in Palo Alto, California, because his team could only compare accuracy where the other systems all returned results rather than empty strings.


Ng said that while the research is still just research for now, Baidu is definitely considering integrating it into its speech-recognition software for smartphones and connected devices such as Baidu Eye. The company is also working on an Amazon Echo-like home appliance called CoolBox, and even a smart bike.

“Some of the applications we already know about would be much more awesome if speech worked in noisy environments,” Ng said.

Deep Speech also outperformed, by about 9 percent, top academic speech-recognition models on a popular dataset called Hub5’00. The system is based on a type of recurrent neural network, which are often used for speech recognition and text analysis. Ng credits much of the success to Baidu’s massive GPU-based deep learning infrastructure, as well as to the novel way them team built up a training set of 100,000 hours of speech data on which to train the system on noisy situations.

Baidu gathered about 7,000 hours of data on people speaking conversationally, and then synthesized a total of roughly 100,000 hours by fusing those files with files containing background noise. That was noise from a restaurant, a television, a cafeteria, and the inside of a car and a train. By contrast, the Hub5’00 dataset includes a total of 2,300 hours.

“This is a vast amount of data,” said Ng. ” … Most systems wouldn’t know what to do with that much speech data.”

Another big improvement, he said, came from using an end-to-end deep learning model on that huge dataset rather than using a standard, and computationally expensive, type of acoustic model. Traditional approaches will break recognition down into multiple steps, including one called speaker adaption, Ng explained, but “we just feed our algorithm a lot of data” and rely on it to learn everything it needs to. Accuracy aside, the Baidu approach also resulted in a dramatically reduced code base, he added.

You can hear Ng talk more about Baidu’s work in deep learning in this Gigaom Future of AI talk embedded below. That event also included a talk from Google speech recognition engineer Johan Schalkwyk. Deep learning will also play a prominent role at our upcoming Structure Data conference, where speakers from [company]Facebook[/company], [company]Yahoo[/company] and elsewhere will discuss how they do it and how it impacts their businesses.