BigML, a machine-learning-based cloud service that lets users generate statistical predictions from their complex data, has revamped the service to include textual analysis. No, it won’t analyze the sentiment of your tweets or translate your documents into Spanish, but it will use words as variables when getting to the bottom of how your data is connected.
BigML’s Andrew Shikiar and Poul Petersen gave me a demo of the new feature last week, and it’s potentially powerful given the right data and a user skilled enough to navigate the relatively simple (this is machine learning, after all) interface. Whereas the service previously would have ignored text columns in any tables that a user uploaded (e.g., the “name” field in a table tracking consumer sales), it now takes the words in those columns into account when predicting outcomes.
Petersen showed me an impressive example where he used data from a Kaggle competition to predict the shelf life of content on StumbleUpon. Included as part of the competition there’s a field labeled “boilerplate,” which includes text in JSON documents. A quick analysis of the entire dataset using BigML showed that the “boilerplate” field is the most important in predicting longevity, and the model predicted with 88 percent confidence that pages containing the word “recipe” will remain popular for a long time.
In this blog post published on Tuesday, Shikiar showed how BigML analyzed a collection of foreign-worker visa applications to predict the salaries (or, in one case, the state) of immigrant technology workers. You can see from the word cloud it produced that Infosys hires a lot of foreign workers:
Here, you can see that the model predicts a certain salary for software employees at Facebook:
There are a handful of options for customizing the text field, too, such as the ability to pare words down to their stems (e.g., “greatness” becomes “great). If you’re into accuracy, BigML also now lets users run ensemble models (or forests) and test the accuracy of their models. Users building models across very large datasets or who have built BigML predictions into their applications via API can use a new feature called PredictServer that runs predictions tens of times faster on a dedicated server.
As BigML keeps maturing and adding new features, its toughest task might be figuring out its target users and tailoring the experience around them. I like the service, but the more features it adds, the more I can see how a formal grounding in statistics and data analysis would help me make better use of it. Then again, if I had those skills, I might prefer any number of advanced software packages that let me do a whole lot more.