Why Google is the big data company that matters most

Every now and then, someone asks “Who’ll be the Google of big data?”. The only acceptable answer, it seems, is that Google (s goog) is the Google of big data. Yeah, it’s a web company on the surface, but Google has been at the forefront of using data to build compelling products for more than a decade, and it’s not showing any signs of slowing down.

Search, advertising, Translate, Play Music, Goggles, Trends and the list goes on — they’re all products that couldn’t exist without lots of data. But data alone doesn’t make products great — they also need to perform fast and reliably, and they eventually need to get more intelligent. Infrastructure and systems engineering make that possible, and that’s where Google really shines.

On Wednesday, the company showed off its chops once again, explaining in a blog post how it’s able to let users better search their photos because it was able to train some novel models on systems built for just that purpose. Here’s how Google describes the chain of events, after it had found the methods it wanted to test (from the winning team at the ImageNet competition):

“We built and trained models similar to those from the winning team using software infrastructure for training large-scale neural networks developed at Google in a group started by Jeff Dean and Andrew Ng. When we evaluated these models, we were impressed; on our test set we saw double the average precision when compared to other approaches we had tried. …

“Why the success now? … What is different is that both computers and algorithms have improved significantly. First, bigger and faster computers have made it feasible to train larger neural networks with much larger data. Ten years ago, running neural networks of this complexity would have been a momentous task even on a single image — now we are able to run them on billions of images. Second, new training techniques have made it possible to train the large deep neural networks necessary for successful image recognition.”

Of course Google had a system in place for training large-scale neural networks. And of course Jeff Dean helped design it.

Google's system can recognize flowers even when they're not in the focal point.
Google’s system can recognize flowers even when they’re not in the focal point.

For me, Dean is among the highlights of our upcoming Structure conference (June 19 and 20 in San Francisco). I’m going to sit down with him in a fireside chat and talk about all the cool systems Google has built thus far and what’s coming down the pike next. Maybe about what life is like being the Chuck Norris of the internet.

From an engineering standpoint, Dean has been one of the most important people in the short history of the web. He helped create MapReduce — the parallel processing engine underneath Google’s original search engine — and was the lead author on the MapReduce paper that directly inspired the creation of Hadoop. Dean has also played significant roles in creating other important Google systems, such as its BigTable distributed data store (which is the basis of NoSQL databases such as Cassandra, HBase and the National Security Agency’s Accumulo) and a globally distributed transactional database called Spanner.

If you’re into big data or webscale systems, knowing what Dean is working on can be like looking into a crystal ball. When I asked Hadoop creator Doug Cutting what the future holds for Hadoop, he told me to look at Google.

“They send us messages through these technical papers,” Cutting said, “so we can see what’s coming.”