Here’s how the NSA analyzes all that call data

The National Security Agency might not have the names of Verizon’s (s vz) wireless customers, but the agency probably can figure out what they’re up to if it’s so inclined. The metadata Verizon has provided the NSA — phone numbers, numbers called, duration of calls, location — is a veritable treasure trove to an organization with the right analytic skills and the right tools. The NSA has both.

There are numerous methods the NSA could use to extract some insights from what must be a mind-blowing number of phone calls and text messages, but graph analysis is likely the king. As we’ve explained numerous times over the past few months, graph analysis is ideal for identifying connections among pieces of data. It’s what powers social graphs, product recommendations and even some fairly complex medical research.

My LinkedIn social graph
My LinkedIn social graph

But now it has really come to the fore as a tool for fighting crime (or intruding on civil liberties, however you want to look at it). The NSA is storing all those Verizon (and, presumably, other carrier records) in a massive database system called Accumulo, which it built itself (on top of Hadoop) a few years ago because there weren’t any other options suitable for its scale and requirements around stability or security. The NSA is currently storing tens of petabytes of data in Accumulo.

For a more thorough description of Accumulo and the NSA infrastructure, read our post “Under the covers of the NSA’s big data effort.”

In graph parlance, vertices are the individual data points (e.g., phone numbers or social network users) and edges are the connections among them. In late May, the NSA released a slide presentation detailing how fast fast Accumulo is able to process a 4.4-trillion-node, 70-trillion-edge graph. By way of comparison, the graph behind Facebook’s Graph Search feature contains billions of nodes and trillions of edges. (In the low trillions, from what I understand.)

So, yes, the NSA is able to easily analyze the call and text-message records of hundreds of million of mobile subscribers. It’s also building out some massive data center real estate to support all the data it’s collecting.


How might a graph analysis work within the NSA? The easy answer, which the government has acknowledged, is to figure out who else is in contact with suspected terrorists. If there’s a strong connection between you and Public Enemy No. 1, the NSA will find out and get to work figuring out who you are. That could be via a search warrant or wiretap authorization, or it could conceivably figure out who someone likely is by using location data.

Having such a big database of call records also provides the NSA with an easy way to go back and find out information about someone should their number pop up in a future investigation. Assuming the number is somewhere in their index, agents can track it down and get to work figuring out who it’s related to and from where it has been making calls.

Presumably, agents could begin with location data, too. If a bomb went off at Location X, bringing up all the numbers making calls from towers in that area might be a good starting point for investigation. Tracking someone’s movement from location data could be helpful, too.

If this all sounds a little creepy, maybe it should. After all, the world’s biggest, baddest intelligence agency can pretty much figure out who you are, who you know and where you go. And unlike web and retail companies that collect and analyze so much data about us, the government can put you in jail.

It might be even creepier when you consider how much other data law enforcement agencies can collect about you without a warrant.

However, someone familiar with NSA policy told me, the good news is that the vast majority of people are still anonymous even in this sea of data: There’s just too much data to care until someone pops up in the bad guys’ networks or gets on the agency’s radar.