etcML: A free, easy and fairly accurate tool for analyzing the text of tweets

If you’re into using machine learning to score a collection of text files for sentiment but don’t want to have to actually learn anything about using machine learning to score a collection of text files for sentiment, have we got a tool for you: It’s called etcML, and it lets you upload a document or enter a Twitter search term, and then set loose on it a classifier to do anything from extracting topics to scoring the sentiment of each sentence or tweet. (Hat tip to FlowingData for pointing out its existence.)

It’s from a group of Stanford graduate and Ph.D. students (including deep learning researcher Richard Socher, who claims to have mastered sentiment analysis on movie reviews) and, best of all, it’s free.

Technically, etcML does reward users who know a thing or two about machine learning and are willing to put forth a little effort — it lets them train their own classifiers on their own data — but the real utility (and fun part) for many people might be the ease with which it lets them perform sentiment analysis on Twitter data. You simply click on “Search for tweets”; enter a hashtag, keyword or handle; choose a classifier (it will automatically suggest one designed for sentiment analysis on tweets); and etcML does the rest. It returns an interactive timeline visualization, a collection of tweets that scored strongly for each label, and a searchable table for table for sorting the results or finding specific words/tweets.

I used it to analyze my body of tweets over the past year, and the results were fairly accurate, although definitely not perfect. They also opened my eyes to the fact that I might consider spicing up my headlines (the majority of my tweets are links to stories I’ve written) with more feeling words.

Lotsa gray there …

Yup, these look about right. (A side benefit is etcML reminds you of particularly negative tweets you might not recall having posted; people who tweet more than minimally might find it an insightful look into their psyches.)

I forgot about that Zynga tweet. That dude was annoying.

As I said, though, it’s not perfect. I’m pretty sure the bottom tweet in the results below wasn’t positive, but sarcasm is a known trouble spot for many natural language processing algorithms.

No, I was not happy to be at Interop.
No, I was not happy to be at Interop.

Oh, and while etcML doesn’t give you exact statistics about the percentage that are positive, negative and neutral, it does let you download your results. I did just that and uploaded them to DataHero (which is fresh in my mind after writing about its funding and redesign on Tuesday morning). The volume of neutral tweets is even more striking as a percentage and section of a pie chart.

DataHero My headlines are neutral (3)

For an added layer of insight, I broke down each label by how confident etcML was of each on average. Apparently, my negative tweets aren’t overtly negative.

DataHero My headlines are neutral (2)

I’m always on the lookout for new tools that make it easy to find data, analyze it and visualize it, and etcML certainly fits that bill. It’s a pretty powerful proposition to know that even if we can’t all be data scientists, we can all try to make some sense out of the data that matters to us for — sometimes in very little time and often for no money. Or, we can just play around with whatever data we have lying around and make some pretty charts.

Here’s a collection of other tools and methods I’ve looked at for analyzing my often personal, insignificant and generally small data: