In August, Google open sourced a tool called word2vec that lets developers and data scientists experiment with language-based deep learning models. Now, the company has published a research paper showing off another use for the technology — automatically detecting the similarities between different languages to create, for example, more accurate dictionaries.
The method works by analyzing how words are used in different languages and representing those relationships as vectors on a two-dimensional graph. Obviously, a computer doesn’t need a visualization to understand the results of the computations, but this one from the paper is instructive in showing the general idea of what the technique does.
Here’s how authors Tomas Mikolov, Quoc V. Le and Ilya Sutskever describe the concept and the chart:
“In Figure 1, we visualize the vectors for numbers and animals in English and Spanish, and it can be easily seen that these concepts have similar geometric arrangements. The reason is that as all common languages share concepts that are grounded in the real world (such as that cat is an animal smaller than a dog), there is often a strong similarity between the vector spaces. The similarity of geometric arrangments in vector spaces is the key reason why our method works well.”
The actual techniques they used were the Skip-gram and Continuous Bag of Words models, which are the same ones exposed by word2vec. The authors describe them thusly:
“The training objective of the CBOW model is to combine the representations of surrounding words to predict the word in the middle. … Similarly, in the Skip-gram model, the training objective is to learn word vector representations that are good at predicting its context in the same sentence. … It is unlike traditional neural network based language models … where the objective is to predict the next word given the context of several preceding words.”
Here’s how I explained their general functionality when covering the word2vec release:
“Its creators have shown how it can recognize the similarities among words (e.g., the countries in Europe) as well as how they’re related to other words (e.g., countries and capitals). It’s able to decipher analogical relationships (e.g., short is to shortest as big is to biggest), word classes (e.g., carnivore and cormorant both relate to animals) and “linguistic regularities” (e.g., “vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to vector(‘queen’)).”
You can see the power of the translation application of these models even when they’re not entirely accurate. One example they note in translating words from Spanish to English is “imperio.” The dictionary entry is “empire,” but the Google system suggested conceptually similar words: “dictatorship,” “imperialism” and “tyranny.” Even if the model can’t replace a dictionary (in fact, the authors note, dictionary entries for English to Czech translations were as accurate or more accurate 85 percent of the time), it could certainly act as a thesaurus or understand the general theme of a foreign text.
There are clear implications to this type of research for Google, which wants to make searchable and understandable the vast amount of data (search, web pages, photos, YouTube videos, etc.) it’s collecting, and also is banking on speech recognition as major point of distinction for its mobile device business. I think you can see some of this work paying off in the new search algorithms and features Google announced on Thursday. AlchemyAPI Founder and CEO Elliot Turner noted to me recently that the same vector representations Google is using on text could also be used on photos and videos, theoretically categorizing them based on the similarity of their content.
Google isn’t the only company working on new deep learning techniques or applications, either. Companies such as Ersatz and the aforementioned AlchemyAPI are exposing the technology as commercial products, and web companies like Baidu and Microsoft are hard at work on their own research efforts.