Under the covers of the NSA’s big data effort

The NSA’s data collection practices have much of America — and certainly the tech community — on edge, but sources familiar with the agency’s technology are saying the situation isn’t as bad as it seems. Yes, the agency has a lot of data and can do some powerful analysis, but, the argument goes, there are strict limits in place around how the agency can use it and who has access. Whether that’s good enough is still an open debate, but here’s what we know about the technology that’s underpinning all that data.

What is Accumulo?

The technological linchpin to everything the NSA is doing from a data-analysis perspective is Accumulo — an open-source database the agency built in order to store and analyze huge amounts of data. Adam Fuchs knows Accumulo well because he helped build it during a nine-year stint with the NSA; he’s now co-founder and CTO of a company called Sqrrl that sells a commercial version of the database system. I spoke with him earlier this week, days before news broke of the NSA collecting data from Verizon and the country’s largest web companies.

Adam Fuchs

The NSA began building Accumulo in late 2007, Fuchs said, because they were trying to do automated analysis for tracking and discovering new terrorism suspects. “We had a set of applications that we wanted to develop and we were looking for the right infrastructure to build them on,” he said.

The problem was those technologies weren’t available. He liked what projects like HBase were doing by using Hadoop to mimic Google’s famous BigTable data store, but it still wasn’t up to the NSA requirements around scalability, reliability or security. So, they began work on a project called CloudBase, which eventually was renamed Accumulo.

Now, Fuchs said, “It’s operating at thousands-of-nodes scale” within the NSA’s data centers. There are multiple instances each storing tens of petabytes (1 petabyte equals 1,000 terabyes or 1 million gigabytes) of data and it’s the backend of the agency’s most widely used analytical capabilities. Accumulo’s ability to handle data in a variety of formats (a characteristic called “schemaless” in database jargon) means the NSA can store data from numerous sources all within the database and add new analytic capabilities in days or even hours.

“It’s quite critical,” he added.

What the NSA can and can’t do with all this data

As I explained on Thursday, Accumulo is especially adept at analyzing trillions of data points in order to build massive graphs that can detect the connections between them and the strength of the connections. Fuchs didn’t talk about the size of the NSA’s graph, but he did say the database is designed to handle months or years worth of information and let analysts move from query to query very fast. When you’re talking about analyzing call records, it’s easy to see where this type of analysis would be valuable in determining how far a suspected terrorist’s network might spread and who might be involved.

Stewart Baker, former NSA general counsel under George W. Bush, wrote on his blog Thursday that this type of data could also be used for for general pattern recognition — the kinds of stuff that targeted advertisers love to do. Only, instead of the system serving someone an ad because of what they’ve been searching for and the operating system they’re using, Baker presented the hypothetical of “[an] American who makes a call to Yemen at 11 a.m., Sanaa time, hangs up after a few seconds, and then gets a call from a different Yemeni number three hours later.”

The big legal question here is around probable cause and whether the government should further investigate this caller based on call patterns similar to those of known terrorists, but the big data question is around false positives. Baker’s hypothetical might appear pretty cut and dry but, data scientist Joseph Turian explains, call records in general probably don’t offer too strong of a signal and could lead to situations where innocent behavior patterns looks a lot like nefarious ones. “But once you start connecting the dots with other pieces of information you have from other sources,” he said via email, “you can start making more predictions.”

This is where a program like PRISM, the NSA’s reported effort to collect data straight from the likes of Google, Facebook and Apple could come into play. If you’re able to tie a name or web account to a phone number, you can figure out all sorts of information. If you can prove that certain people are radical Islamists, for example, you can start to infer more things about the others in that social graph.

And if Sqrrl’s capabilities are any indicator of what Accumulo is supporting within the NSA, the agency can perform a lot of simpler functions on its data as well. In addition to graph processing, said Ely Kahn, Sqrrl’s co-founder and VP of business development, their product includes pre-packaged analytic capabilities around SQL queries and full-text search, and also supports streaming data. This means Sqrrl’s version can support any number of interesting use cases — from processing data as it hits the system to keeping a massive index that can be searched in the same way someone searches the web.

How much data is the NSA collecting? Follow the money

We’re not quite sure how much data the two programs that came to light this week are actually collecting, but the evidence suggests it’s not that much — at least from a volume perspective. Take the PRISM program that’s gathering data from web properties including Google, Facebook, Microsoft, Apple, Yahoo and AOL. It seems the NSA would have to be selective in what it grabs.

Assuming it includes every cost associated with running the program, the $20 million per year allocated to PRISM, according to the slides published by the Washington Post, wouldn’t be nearly enough to store all the raw data — much less new datasets created from analyses — from such large web properties. Yahoo alone, I’m told, was spending over $100 million a year to operate its approximately 42,000-node Hadoop environment, consisting of hundreds of petabytes, a few years ago. Facebook users are generating more than 500 terabytes of new data every day.

Using about the least-expensive option around for mass storage — cloud storage provider Backblaze’s open source storage pod designs — just storing 500 terabytes of Facebook data a day would cost more than $10 million in hardware alone over the course of a year. Using higher-performance hard drives or other premium gear — things Backblaze eschews because it’s concerned primarily about cost and scalability rather than performance — would cost even more.

Even at the Backblaze price point, though, which is pocket change for the NSA, the agency would easily run over $20 million trying to store too many emails, chats, Skype calls, photos, videos and other types data from the other companies it’s working with.

Actually, it’s possible the intelligence community is taking advantage of the Backblaze designs. In September 2011, Backblaze CEO Gleb Budman says, he met with CIA representatives who discussed that agency’s five-year plan “to centralize data services into a large private cloud” and how Backblaze’s technology might fit into it. Its plans for analyzing this data, as illustrated in the slide below (and discussed by CIA CTO Ira “Gus” Hunt at Structure: Data in March), seem to mirror what the NSA has in mind.

cia big dataWhatever type of gear the NSA is using, though, and how ever much it’s spending on the Verizon data or PRISM specifically, we do know the agency is spending a lot of money on its data infrastructure. There are those dozens (at least) of petabytes of overall data in Accumulo, and the agency is famously building a 1-million-square-foot, $1.5 billion data center in Utah. It recently began construction on a 600,000-square-foot, $860 million facility in Maryland.

Policies are in place

Sqrrl’s Kahn — who previously served as director of cybersecurity strategy at the National Security Staff in the White House — says even with all the effort it’s putting into data collection and analysis, the NSA really is concerned about privacy. Not only are there strict administrative and legal limitations in place about when the agency can actually search through collected data (something Stewart Baker explains in more detail in a Friday blog post), but Accumulo itself was designed with privacy in mind.

The system itself is designed to make sure there’s not a free-for-all on data, another individual familiar with Accumulo said.

It has what Kahn and Sqrrl CTO Fuchs described as “cell-level” security, meaning administrators can manage access to individual pieces of data within a table. Furthermore, Fuchs explained, those policies stick with the data as it’s transformed as part of the analysis process, so someone prohibited from seeing it won’t be able to see it just because it’s now part of a different dataset. When data would come into the NSA from the CIA, he said, there were policies in place around who could see it, and Accumulo helped enforce them.

Even agencies within the Department of Homeland Security are using or experimenting with Accumulo, Kahn added, because proposed legislation would put them in charge of ensuring privacy as cybersecurity data exchanges hands between the government and private corporations.

It’s ironic he acknowledged, but Accumulo actually flips the presumed paradigm that stricter security and privacy regulations mean less sharing. That might be a shallow victory for citizens concerned about their civil liberties, but data collection and sharing don’t seem likely to stop any time soon. At least it’s something.