One of the biggest tropes in the era of big data is that data is the new oil — it’s very valuable to the companies that have it, but only after it has been mined and processed. The analogy makes some sense, but it ignores the fact that people and companies don’t have the means to collect the data they need or the ability to process it once they have it. A lot of us just need gasoline.
Which is why I was excited to see the new Data for Everyone initiative that crowdsourcing startup CrowdFlower released on Wednesday. It’s a library of interesting and free datasets that have been gathered by CrowdFlower’s users over the years and verified by the company’s crowdsourced labor force. Topics range from Twitter sentiment on various subjects to a collection of labeled medical images.
Data for Everyone is far from comprehensive or from being any sort of one-stop shop for data democratization, but it is a good approach to a problem that lots of folks have been trying to solve for years. Namely, giving people interested in analyzing valuable data access to that data in a meaningful way. Unfortunately, early attempts at data marketplaces such as Infochimps and Quandl, and even earlier incarnations of the federal Data.gov service, often included poorly formatted data or suffered from a dearth of interesting datasets.
It’s often said that data analysts spend 85 percent of their time formatting data and only 15 percent of it actually analyzing data — a situation that is simply untenable for people whose jobs don’t revolve around data, even as tools for data analysis continue to improve. All the Tableau software or Watson Analytics or DataHero or PowerBI services in the world don’t do a whole lot to help mortals analyze data when it’s riddled with errors or formatted so sloppily it takes a day just to get it ready to upload.
Hopefully, we’ll start to see more high-quality data markets pop up, as well as better tools for collecting data from services such as Twitter. They don’t necessarily need to be so easy a 10-year-old can use them, but they do need to be easy enough that someone with basic programming or analytic skills can get up and running without quitting their day job. Data for Everyone looks like one, as does the new Wolfram Data Drop, also announced on Wednesday.
Because while it’s getting a lot easier for large companies and professional data scientists to collect their data and analyze it for purposes ranging from business intelligence to training robotic brains — topics we’ll be discussing at our Structure Data conference later this month — the little guy, strapped for time and resources, still needs more help.