How Google is teaching computers to see

Google’s Hangouts recognize people, even if they are just a photo.

Google (s goog) is attempting to teach computers to recognize human faces without telling the computing algorithms which faces are human. It’s a machine-learning problem made for this era of unstructured data and easy access to large compute clusters. It could help the search giant make huge strides in building the next big opportunity in tech, enabling computers to “see.”

A Google research paper prepared for the upcoming International Machine Learning Conference explains how Google has managed to distinguish human faces with 15.8 percent accuracy using 1,000 machines with 16,000 cores and an image repository.

It also can recognize cats and body parts, elements chosen because they were so common on the YouTube stills used to create the image database that researchers used to train the algorithm. The accuracy may not seem impressive to us — it identifies roughly 4 out of 25 faces as actual faces — but it’s a 70 percent improvement over previous efforts.

Google’s Hangouts knows pictures of people from pictures of dogs.

The net result is Google can take thousands of images, clean them up, and then learn how to group similar images into categories such as “faces” or “cats.” This has been possible using state-of-the-art systems for a while, but those required tagged images as well as a long learning period. Google’s experiment tried to use unlabeled images and threw a lot of computing at the learning process to reduce the time it took to train the algorithm from weeks to just three days.

Google gets somewhat profound in its paper, noting that if machines can learn like this, perhaps it’s also how humans learn:

This work investigates the feasibility of building high-level features from only unlabeled data. A positive answer to this question will give rise to two significant results. Practically, this provides an inexpensive way to develop features from unlabeled data. But perhaps more importantly, it answers an intriguing question as to whether the specificity of the “grandmother neuron” could possibly be learned from unlabeled data. Informally, this would suggest that it is at least in principle possible that a baby learns to group faces into one class ?because it has seen many of them and not because it is guided by supervision or rewards.

Understanding the origins of language and how people learn to classify objects is something people are still trying to work out, so Google may be onto something that philosophers and anthropologists can debate in alcohol-fueled conversations at university cafes (or machine-learning conferences). But from a practical perspective, throwing a lot of compute at unstructured data to give computers the ability to see could be a gateway for Google to build a platform for the next big thing in tech.

Computers that see and computers that learn

Source: Google

In devices like the Microsoft (s msft) Kinect, Google Glasses and other gadgets that use gesture recognition, getting computers to see the world is as complex as getting humans to do so. It’s possible to teach computers to recognize gestures that are preprogrammed into its software. Even some touch-based systems are actually taking advantage of cameras to see and interpret different gestures, but getting a computer to actually “see” is far more complicated.

People see using their eyes and their brains. Our eyes are sensors detecting gradations in light, dark, color, etc. That information is conveyed to the brain, where it is interpreted. The brain plays all sorts of tricks with the actual world, though. It fills in blanks, ignores the mundane and can be tricked via optical illusions.

Computers have cameras and a variety of sensors that can act as eyes, but the brain part is a challenge. To train a computer to “see,” programmers have to train machines and offer them ways to behave in any given scenario or gesture combination. Google has shown a way to reduce the training time by throwing a ton of computers at the problem and to reduce the specificity related to image recognition by showing that computers could be trained to recognize images if they see enough of them and have enough processing power.

The Google researchers note that its network of computers is one of the largest researchers know of. The Google network has 1 billion trainable parameters, which is more than an order of magnitude larger than other large networks reported in literature that have 10 million parameters. But even the Google network pales in comparison to the human visual cortex, which is a million times larger in terms of the number of neurons and synapses.

But teaching computers to see is huge (especially if, along the way, you teach computers how to learn). Imagine if your smartphone could “see” an object and then classify it. It then could access the rich trove of data it has about that object — be it a building, a piece of art or a meal — and deliver information to you or an app. Right now we have to enter in that information, in many cases using a tiny keypad on a mobile device or snapping a picture and relying on a much less robust database of learned visual elements like Google Goggles does. Clearly, there are also privacy concerns, as governments could put the networked compute of a thousand Googles behind their surveillance efforts.

At home, there are entertainment benefits, as Kinect-like devices could see and interpret a user’s actions, not just in the confines of a game but also in the free-flowing world of everyday television. For example, allowing a child to interact via a Kinect with Sesame Street is something people at Microsoft are trying to develop. On a smartphone or in the home, it’s pretty heady stuff.