Georgetown University researcher Kalev Leetaru has spent years building the Global Database of Events, Languages, and Tones. It now contains data on more than 250 million events dating back to 1979 and updated daily, with 58 different fields apiece, across 300 categories. Leetaru uses it to produce a daily report analyzing global stability. He and others have used it to figure out whether the kidnapping of 200 Nigerian girls was a predictable event and watch Crimea turn into a hotspot of activity leading up to ex-Ukrainian Viktor Yanukovych’s ouster and Russia’s subsequent invasion.
“The idea of GDELT is how do we create a catalog, essentially, of everything that’s going on across the planet, each day,” Leetaru explained in a recent interview.
And now all of it is available in the cloud, for free, for anybody to analyze as they desire. Leetaru has partnered with Google, where he has been hosting GDELT for the past year, to make it available (here) as a public dataset that users can analyze directly with Google BigQuery. Previously, anyone interested in the data had to download the 100-gigabyte dataset and analyze it on their own machines. They still can, of course, and Leetaru recently built a catalog of recipes for various analyses and a BigQuery-based method for slicing off specific parts of the data.
But there’s big promise in removing barriers by letting data scientists, policy analysts and researchers dig into it right from a browser window. BigQuery is actually remarkably powerful in terms of the types of analysis it enables, Leetaru explained, and it’s fast. Tasks that used to take him hours now take him seconds. He (as I have before) calls that kind of computing power and capability, paired with such valuable data and data scientists who can make sense of it, “a perfect marriage.”
“You’ve got all this pent-up [analytic] expertise out there,” he said. “… Go run these big queries. Tell us what’s possible.”
(Leetaru, who used to work with supercomputers (he helped create the supercomputer-powered Twitter Heartbeat project), also lauded the automation and performance of Google Compute Engine — something I’ll discuss with Google SVP Urs Hölzle at our Structure conference next month.)
Leetaru has big ideas for what he thinks is possible with GDELT. He wants us to be able to understand, in real-time, what’s going on in the human world just like the USGS can tell us about earthquakes — what happened, where and what to expect next. Right now, for example, Leetaru thinks he’ll be able to analyze 90 days of activity around the recent coup in Thailand and then find similar patterns around the world over the last 35 years. That could help shed light on what will happen next, to prove (or not) that history really does repeat itself.
He’s also excited about the scale he has at his fingertips, in terms of both data sources and computing. Right now, GDELT is populated from numerous news sources around the world, their content automatically processed by text-analysis and geocoding algorithms Leetaru has built and then added to the database. With advances in natural-language processing and translation, however, he’s confident he’ll soon be able to grab even more content from non-English sources (75 percent of their content, he said, isn’t available in English anywhere on the planet).
“We get too trapped in this western narrative,” he explained, citing pushback from foreign policy experts (Leetaru is a frequent contributor to Foreign Policy magazine) about the threat to Crimea when he analyzed the situation in Ukraine. The more data we have from other parts of the world, written by people on the ground, the easier time we should have predicting what will happen there.
Which is why Leetaru is also trying to get a grip on how social media operates around the world, so he can incorporate those feeds into GDELT. He’s working with the U.S. Army to translate the world’s academic literature, and he’s generally looking for ways to digitize as much content as possible going back as far in history as possible. “How do we bring all of this into one fold and, essentially, codify the world?” he asked.
This grand sort of goal wasn’t really feasible when Leetaru was doing everything from his desktop. “Now, on the cloud,” he added, “it can pretty much expand at its leisure.”
Although it’s unique because of its connection to BigQuery for analysis, GDELT isn’t the only large dataset available in the cloud. Google itself hosts many others, as does Amazon Web Services — including the 1,000 Genomes Project, U.S. Census data, NASA NEX and Freebase datasets.