In a world of billion-dollar web companies and VC-backed startups trying to forever change human interaction via software, IBM tends to look a little staid. But don’t let its deliberate pace, legacy-software-mongering ways and suited executives fool you. If you pull back the covers, you’ll find Big Blue investing resources as only it can to solve its customers’ most-pressing problems. Right now, IBM has big data in its crosshairs, and its ambitions go far beyond recent moves to derail Oracle’s momentum in the online transaction processing (OLTP) space.
If you’ve seen IBM’s “Smarter Planet” commercial about business analytics and busy intersections, then you’ve seen Jeff Jonas. He joined IBM in 2005 when it bought Systems Research & Development (SRD), which Jonas founded, and has stayed on as a distinguished engineer and chief scientist of IBM’s Entity Analytics Group.
The focal point of the Entity Analytics Group is InfoSphere Identity Insight (Jonas’s brainchild), which helps organizations combat fraud and other threats by drawing connections between various types of identifying data. The SRD purchase was part of a total $12 billion IBM has spent in analytics R&D and acquisitions in the past five years, and the investment is paying off in the form of software that helps organizations make real-time, sub-second decisions about data as it arrives.
Historically, the problem has been that businesses suffer from what Jonas calls “enterprise amnesia” — they have too much data and there is too big a disconnect among the various departments where it’s housed. As every new piece of information arrives, the gap widens between what the organization has and what it actually knows.
This causes problems, like the large retail organization that found out, too late, that two out of every 1,000 new hires already has been arrested for stealing from that very store, or a bank calling Jonas every day to get his business after he already had refinanced his mortgage through it.
“Every time I get duplicate mail pieces, I think about the kind of company it would be that would not be able to notice that,” bemoans Jonas.
Who’s Who? Connecting the Dots with Data
In Jonas’s world, more data should mean faster, more accurate decisions, not disorganization and confusion. Two points of contact with a direct-mailing company shouldn’t generate two marketing circulars; they should generate one better-targeted one. He analogizes this phenomenon to a puzzle, where people can place the last few pieces as fast as they can the first few because the picture becomes clearer as the pieces converge. Like a puzzle, data becomes much easier to solve when it’s viewed in context, and that’s what InfoSphere Identity Insight does — it discovers, analyzes and indexes every piece of data about people or organizations as they enter an organization, and it rapidly draws connections among them as each new piece enters.
During Hurricane Katrina, for example, the software helped de-duplicate lists of missing persons. The top 15 sites listing missing persons contained 1.5 million names, and IBM worked with those sites to whittle the list down to 36,000 unique names, resulting in the reunification of more than 100 loved ones. Alameda County Social Services in California uses Identity Insights to help reduce fraud claims and locate what services are due to whom by connecting disparate data from across departments. According to Jonas, the software has cut times for certain analytical jobs from months to minutes. IBM has detailed the implementation with a video presentation.
As GigaOM’s Stacey Higginbotham recently reported, IBM also is busy on the predictive analytics front. For example, law-enforcement agencies across the globe are using IBM’s SPSS software – another product of its $12 billion analytics investment – to predict the likelihood that criminals and juvenile offenders will reoffend upon release. As Stacey writes, “[t]he software can look at far more data inputs and potentially handle more juvenile offenders faster than the older methods, and presumably the ability to incorporate more data points could lead to better results.”
Whether organizations are concerned with identity analytics or predictive analytics, they need advanced analytics software because data volumes are proliferating so much and in such varying forms that human users cannot be expected keep track of everything, or even to know what questions to ask of the data. “We actually have to get to where the data finds the data, and the things that are relevant find the people,” Jonas explained.
Paired with products like IBM’s InfoSphere Streams, the data can come from pretty much any source, be it foreign-language text, sensor data or video (closed captions, for example, can be full of information). That product analyzes data, determines what’s relevant for a particular task and acts on it accordingly (e.g., sending it to InfoSphere Identity Insight to compare against existing identity data).
An example of InfoSphere Streams at work is a recently announced case study coming out of Stockholm, Sweden. There, researchers are analyzing data from GPS devices in the city’s taxicabs – with many more sources planned – to determine real-time traffic flows in order to help citizens optimize their commutes.
The Hadoop Connection

From book stacks to BigSheets, the British Library is full of info. Source: Colin St John Wilson via Flickr
However, IBM’s big-data vision isn’t all about figuring out who’s who in real time or determining whether an individual criminal will find himself in jail yet again. As Jonas puts it, “To make a smart system smart, you have to actually learn your past.” This is where batch-processing via Hadoop comes in. It lays out the puzzle pieces a company has seen previously, helping the real-time analytic software figure out where the new pieces fit.
Presently, IBM’s most notable Hadoop implementation is its BigSheets “insight engine,” which resides in the company’s Emerging Technologies division. BigSheets is a composite of many tools — including Hadoop, Nutch, Pig, InfoSphere and IBM’s ManyEyes visualization technology – that division CTO David Boloker says exists to help business users extract as much value as possible from the mountains of data at their disposal. The results can be presented as any of a variety of charts or maps, but Boloker believes tag clouds might be the most appealing option, especially for users who want to discover trends among their data.
Because of its business-user target, BigSheets’ initial interface is a spreadsheet (work is underway to figure out the best UI), but Boloker is quick to point out that BigSheets is not “Excel for big data.” What it is is a veneer for Hadoop, and the other components, that churns through potentially petabytes of data (current engagements top out at a few hundred terabytes), extracting and presenting mere megabytes of relevant data.
Users can analyze, compare and visualize to their hearts’ desires, Boloker explained, but “[they] don’t know if it’s running on two nodes or a hundred nodes.” He says BigSheets isn’t so much a Hadoop product as much as it is a product that uses Hadoop (like IBM’s WebSphere application platform uses the Apache web server).
Other commercial Hadoop products, like those sold by Cloudera and Karmasphere, simplify the task of managing Hadoop environments and running Hadoop jobs, but they are wholly Hadoop products. Even Datameer’s new Datameer Analytics Solution — which combines spreadsheet functionality with Hadoop processing, is garnering rave reviews early on and might well be “Excel for big data” – simply (if you can call it that) hides Hadoop’s complexity behind the spreadsheet.
The fact that these startups have received almost $20 million in combined funding speaks to how powerful Hadoop is on its own, especially when the learning curve is reduced drastically. However, IBM’s goal with BigSheets is a soup-to-nuts solution that packages, and abstracts, everything its customers might need to handle their big-data analysis projects.
The only publicly announced BigSheets customer thus far is the British Library, which is using it to analyze huge amounts of archived web-site data, but Boloker said IBM was working with five other organizations as of early March. Among them: a pharmaceutical company analyzing latent issues (e.g., side effects) during the human-testing phase of new-drug development; a legal content creator comparing its stock of documents to what’s available on the web; and a retail organization expecting to top the petabyte barrier in transaction data. Boloker also said IBM is “chatting” with financial customers interested in mining the web for data on potential acquisition targets, and then merging that data with what they already have in-house.
What the Future Holds
Of course, if the history of computing has taught us anything, it’s that important technologies don’t stay static for long. Given the widely accepted belief that, going forward, the most successful organizations will be those that best draw insights from their data, we shouldn’t expect analytics tools – from IBM or elsewhere – to remain in their current states for long. And even if IBM isn’t alone in blazing the trail toward the next generation of analytics, it looks like it certainly will be helping advance the cause.
As will be the fate of most IT tools, Boloker envisions a future in which IBM could deliver BigSheets as a cloud service via a web interface. The most obvious use case would be customers using it to mine web data or their own data sets, but Boloker thinks customers also could use BigSheets to leverage data that IBM has collected for its own internal uses over the years.
For example, Big Blue recently used BigSheets to analyze all the patents issued over the past 10 years to find out which were the most cited and which were of the highest value. A customer performing due diligence before an acquisition might want to query IBM’s reduced patent data to determine what high-value patents its target owns and who is citing them. As with on-premise implementations, though, customers accessing BigSheets via the cloud need not have any idea that Hadoop or the other components are handling the legwork of their queries.
Jonas’s vision is both fascinating and a little disturbing. He sees geospatial data leading the charge toward optimal consumer experiences, such as perfectly timed mobile ads and personalized traffic optimization. Our cellular providers already have the data to determine where we are and where we have been, he explains, and with the right analytics they could figure out where we’re headed and send traffic tips based on current road conditions.
Advertisements and deal offers could reach us at ideal times, like when our contracts are expiring and we’re heading toward a competitor’s store. Jonas acknowledges the privacy concerns inherent in such a service, but he believes people will line up for it. Taking it a step further, he predicts that a surveillance society is inevitable — and that we will love it because people are eager to optimize their lives.
“It’s seemingly irresistible to us,” he says.
