We might be tiring of the term big data, but there’s still a lot of value to be squeezed from the concept. This is true even in its purest form, where we’re doing relatively simple operations on a mountain of data in order to see if there’s a notable trend or correlation in there.
The latest example of why this is true comes from GDELT, the massive geosocial-event database that’s now housed in Google’s cloud. Its creator, Georgetown professor Kalev Leetaru, has analyzed the Arab Spring uprising in Egypt, as well as the current situation in Ukraine, against data dating back to 1979 in an attempt to answer the question of whether history really does repeat itself.
Finding the answer, he acknowledges, will take a lot more expert analysis, but his data can give researchers a great start. The process of generating it was a single SQL query (researchers can access GDELT and analyze it for free using Google BigQuery) to find periods, and countries, in history that experienced similar patterns of activity as any given period. For example, the 60 days preceding the ouster of former Ukrainian president Viktor Yanukovych and the 60 days after that.
The example above compares just that, with Ukraine in red and a 120-day period from 1999 in Turkey — the most highly correlated period — in green.
In another query, shown below, Leetaru averaged post-peak events in Turkey and 120 days from 2007 in Libya (the second highest correlation with Ukraine). He claims the results are noteworthy in how they differ from an analysis of Arab Spring Egypt and its closest corollaries. While Ukraine, Turkey and Libya matched each other in the spikiness of events even after their peaks, Egypt and its close matches show a marked and relatively sustained drop.
Leetaru’s takeaway from his analysis?
“While it is unlikely that one would build a true political risk forecasting system on an approach this simple, it does suggest that world history, at least the view of it we see through the news media, is highly cyclic and predictable, and that there is much yet to be discovered. Will these patterns hold for every country and time period and is there a certain rolling window size that works better or worse? Does a different time interval or switching to a different set of event types improve or degrade accuracy? Does it work better just before a conflict or only in its first few days?”
I would add that this is where subject-matter experts come in to start examining what else those periods have in common, and what types of events we’re dealing with. They might examine how leaders, geographies or all sorts of other factors seem to affect these patterns.
And that’s the real value of analyzing really big data about important, complicated issues. I think it holds true for most things, from Google Flu Trends to cancer research, and from health care to industrial machines. Strong correlations don’t necessarily mean causation, or that certain outcomes are guaranteed, but across large enough datasets they’re a big, flashing red arrow saying “Examine this!”
The real value of cloud computing is in putting all this data in a centralized place with centralized computing resources so researchers aren’t on the hook for somehow downloading it, storing it and having enough computers to analyze it. Last month, for example, both Amazon Web Services and Microsoft announced contests related to the White House’s Climate Data Initiative.
It might be there’s nothing of value to be gleaned from Leetaru’s analysis of modern history, or might be there’s a nugget of immense value buried a few layers below the surface. But if we really want to find answers to tough problems, we owe it to ourselves to examine every signal. Done right, big data provides a lot of them.