Editor’s note: Guest post by Chris Mohritz, an AI entrepreneur and AI technologist. He is leading the Using AI Workshop at Gigaom AI.
As human beings, most of us experience the world through our eyes.
We are experts at instantly interpreting visual information and turning it into concepts and information that we can understand and share.
And historically, capturing useful information from images in a business context has required human vision — which, unfortunately, can be slow and costly.
But what if we could extract and process that useful information through computers — what kind of insights and opportunities would that open up for your business?
Well, recent advancements in artificial intelligence have made computer vision possible. And giving computers the power of sight is already having a profound effect on our lives, work, and society as a whole.
The only question left is: How will your business benefit from computer vision?
- Want to allow your customers to search for your products visually?
- Need to organize and understand what’s in a pile of random images?
- Need to get a view into what your customers are posting about on social media?
- Need to capture the data in a pile of handwritten notes?
- Need to locate specific objects within a stream of customer photos?
- Need to keep an eye out for your logo?
- Need to understand the emotions of people in your store or using your app?
- Want to build an app that can see the world like people do?
- Have a product that needs to drive itself?
Those are just a few of the many different options now open to you.
And like everything A.I. — it’s happening very quickly. Computers are already better than people at identifying certain types of objects.
So let’s jump right in…
Unlocking the Visual World
To liberate visual data we need an intelligent visual recognition service that automatically analyzes and identifies objects and scenes in image files (video, etc.).
For this guide, we’ve chosen Google’s Cloud Vision API. So we’ll walk through interacting with the API using a couple of simple scripts (originally developed by Google Cloud). We can use these scripts to find faces, logos, landmarks and more.
Looking for an on ramp?
This is a how-to guide intended for developers or tech-savvy business leaders looking for an proven entry point into A.I.-powered business systems.
But before we dive into the scripts, let’s take a quick look at a demo app…
A Fun Preview
At Google’s Next 2016 conference, Google demoed a slick little application that puts all of the Cloud Vision API features into one simple interface.
It’s call Cloud Vision Explorer — here’s a live demo (requires Chrome v50 or later).
A description of how it works is here. And in typical Google fashion, it’s open source. Source code and instructions to create your own is here.
Inspired? Then, let’s jump right in.
- What You’ll Need
- Detecting Faces
- Detecting Labels
- Detecting Landmarks
- Detecting Text
- Detecting Logos
Accessing and using the API is very easy, it will only take you a few minutes to run through these examples.
What You’ll Need
Before we start pushing images up to the Cloud Vision API, let’s get the initial requirements knocked out.
Download the source repository.
To start, let’s pull down the source files. (You’ll need a git client installed on your computer for this step.)
Move to the directory you want to use for this demo and run the following commands in a terminal…
# Download source repository
git clone https://github.com/10xNation/google-image-recognition.git
cd google-image-recognition
There are only a few files, so don’t blink.
Configure Python.
Next, you’ll need Python installed, plus the pip and virtualenv utilities if you don’t already have them.
And then we need to create an environment to run the scripts in (scripts are compatible with Python 2.7 and 3.4+), and you can do that by running the following commands in a terminal…
# Configure a virtual environment to work in
virtualenv env
# Activate virtual environment
source env/bin/activate
And finally, we need to install the needed dependencies by running the following command (also in a terminal)…
# Install required packages
pip install -r requirements.txt
Create a Google Cloud account.
Go to the Google Cloud home page.
If you don’t already have a Google Cloud account, go ahead and create one by clicking on the “Try it free” button and completing the registration process.
Install the Cloud SDK.
All of the steps in this guide go through command line, so you’ll need to install and initialize the Cloud SDK.
And that’s pretty much it, we’re ready to start interacting with the API. So let’s jump into the fun stuff.
Detecting Faces
The API’s face detection feature can detect multiple faces within a single image, including all the associated facial attributes that can convey the person’s emotional state and/or if they’re wearing clothing.
Note: Facial recognition — determining who the face is — is currently not supported by the Cloud Vision API.
For face detection, we have a special script that can annotate where the API picks up the faces. To use it, simply run the following command with your chosen image…
# Upload an image to the API
python label-faces.py your_image.jpg
Obscure shots.
The API seems to do a good job of finding faces that are less obvious.
The API is nearly 100% confident that this boy is joyful. Here’s an (abbreviated) response:
{
"faceAnnotations": [
{
"detectionConfidence": 0.99999869,
"joyLikelihood": "VERY_LIKELY",
"angerLikelihood": "VERY_UNLIKELY",
"surpriseLikelihood": "VERY_UNLIKELY",
}
]
}
Differentiating.
Of course, which type of face is important.
The API is also nearly 100% confident that this woman isn’t showing much emotion at all. Here’s an (abbreviated) response:
{
"faceAnnotations": [
{
"detectionConfidence": 0.9999327,
"joyLikelihood": "UNLIKELY",
"angerLikelihood": ""VERY_UNLIKELY",
"surpriseLikelihood": "VERY_UNLIKELY",
}
]
}
In addition, the API picks up the dog in this picture using the labels feature below; it’s 93.9% confident. Here’s the response:
{
"labelAnnotations": [
{
"description": "dog",
"score": 0.93934375,
"mid": "/m/0bt9lr"
},
{
"description": "nose",
"score": 0.9241457,
"mid": "/m/0k0pj"
},
{
"description": "mammal",
"score": 0.9114142,
"mid": "/m/04rky"
},
{
"description": "close up",
"score": 0.779961,
"mid": "/m/02cqfm"
},
{
"description": "head",
"score": 0.74782485,
"mid": "/m/04hgtk"
},
{
"description": "skin",
"score": 0.7348932,
"mid": "/m/06z04"
},
{
"description": "mouth",
"score": 0.69471675,
"mid": "/m/0283dt1"
},
{
"description": "organ",
"score": 0.6678281,
"mid": "/m/013y0j"
},
{
"description": "dog like mammal",
"score": 0.6455104,
"mid": "/m/01z5f"
},
{
"description": "eye",
"score": 0.59434783,
"mid": "/m/014sv8"
}
]
}
Limitations.
Couldn’t pick up the face in this image:
False positives.
Didn’t pick up a face in either of these images:
Detecting Labels
The API’s label detection feature can detect a wide range of different objects within an image, ranging from vehicles and other modes of transportation to various types of animals.
And you run the label detection analysis using the following command…
# Upload an image to the API
python detect.py labels your_image.jpg
Toys.
Successfully identified this image:
The API is 96.7% confident that it’s a Rubik’s Cube. Here’s the response:
{
"labelAnnotations": [
{
"score": 0.96748275,
"mid": "/m/0h0b5",
"description": "rubik's cube"
}
]
}
Note: The mid
item in the response is a unique identifier for the object in Google’s Knowledge Graph, which you can use to gather even more information about the object.
Animals.
It also successfully identified this image:
The API is 98.4% confident that it’s a hippopotamus. Here’s the response:
{
"labelAnnotations": [
{
"score": 0.98434836,
"mid": "/m/09f20",
"description": "hippopotamus"
}
]
}
But the system incorrectly identified this image:
It called this one a “habitat” at 95.9% confidence. Here’s the response:
{
"labelAnnotations": [
{
"score": 0.95886225,
"mid": "/m/05fblh",
"description": "habitat"
}
]
}
Vehicles.
This image was successfully identified:
The API is 99.2% confident that it’s a car — which is impressive considering it’s just a close-up of the grill. Here’s the response:
{
"labelAnnotations": [
{
"score": 0.9921955,
"mid": "/m/0k4j",
"description": "car"
}
]
}
Interestingly, the API doesn’t pick up the logo in this image using the logo detection below.
Detecting Landmarks
The API’s landmark detection feature can pick up many popular structures — natural or man-made.
It will return the landmark’s name, geographic coordinates (latitude/longitude), and a definition of where the landmark is located within the image.
And you run the landmark detection analysis using the following command…
# Upload an image to the API
python detect.py landmarks your_image.jpg
This image was successfully identified:
The API is 92.6% confident that it’s the Statue of Liberty — but then again, it was also 88% confident that it’s the Brooklyn Bridge. Here’s the response:
{
"landmarkAnnotations": [
{
"description": "Statue of Liberty",
"score": 0.92645186,
"mid": "/m/072p8",
},
{
"description": "Brooklyn Bridge",
"score": 0.87944186,
"mid": "/m/0cv4c",
},
{
"description": "Times Square",
"score": 0.8787856,
"mid": "/m/07qdr",
}
]
}
Detecting Text
The API’s text detection features runs a character recognition algorithm on the image. Allowing it to detect and extract text within your images. It also includes support for a wide range of languages — it even automatically determines which language.
And you run the text detection analysis using the following command…
# Upload an image to the API
python detect.py text your_image.jpg
This image was successfully identified:
The system found the word UNDERGROUND (in english). Here’s the response:
{
"textAnnotations": [
{
"_locale": en,
"description": "UNDERGROUND"
}
]
}
Detecting Logos
The API’s logo detection feature identifies popular brand logos within your images.
And you run the logo detection analysis using the following command…
# Upload an image to the API
python detect.py logos your_image.jpg
This image was successfully identified (sort of):
The API is 16.1% (un)confident that it’s a BMW Motorrad logo, which is interesting. If it had only responded with BMW, all would have been well. But the system called it a BMW motorcycle — and it is in-fact a car (with a motorcycle engine). Here’s the response:
{
"textAnnotations": [
{
"score": 0.16129597,
"mid": "/m/0466q6s",
"description": "BMW Motorrad"
}
]
}
The system nailed this one — although it was a bit skittish about it:
The API is 33.8% confident that it’s a (Transformers) Autobot. Here’s the response:
{
"textAnnotations": [
{
"score": 0.33826414,
"mid": "/m/036zwh",
"description": "Autobot"
}
]
}
Take it to the Next Level
This is just the tip of the iceberg. The real power comes when you apply these smart capabilities to solve real-world challenges and enhance our world.
How will you build on these scripts? How can you enhance your existing products and applications — or create completely new ones?
You can dig deeper into the Cloud Vision API — including additional tutorials — in the Google Cloud developer documentation.
Enjoy!
Chris Mohritz is leading the Using AI Workshop in San Francisco, on February 14th. Join us! Or come to Gigaom AI on February 15-16.