Like most people, I suspect, I wasn’t too surprised to find out that Hadoop-focused startup Karmasphere has secured a $5 million initial funding round. After all, Hadoop’s mastery of large-scale analysis of unstructured data has made it a darling of web companies – not to mention utility companies, defense contractors, social scientists, and even traditional IT vendors. Then there’s the fact that fellow Hadoop commercializer Cloudera already has brought in $11 million itself. If Hadoop catches on like the evidence suggests it will, Karmasphere’s desktop-based Hadoop-management tools could pay off investors many times over.
In some ways, though, the fact that Hadoop is mature enough to inspire commercial products means it’s yesterday’s news. I’m wondering which open-source, big-data-inspired product will be the next to launch a wave of startups and drive tens of millions in VC spending — Cassandra? Or Gizzard, perhaps?
Given its growing popularity and expanding functionality, Cassandra right now seems like a prime candidate. Although its roots are with Facebook, Rackspace has taken over the Cassandra-development reins, and it also has caught within Digg, Twitter, Reddit, Cloudkick and Cisco to name a few other users. Their varying use cases – from storing real-time session data to housing mountains of machine-generated metrics – illustrate Cassandra’s versatility. It’s not only for the social media crowd.
Furthermore, Cassandra graduated to a top-level Apache project in February, signifying the quality of the work done on it thus far and, most likely, a groundswell of new developers. To whatever degree the comparison is relevant, just more than a year lapsed between Hadoop becoming a top-level Apache project in January 2008 and Cloudera’s March 2009 official launch. If that trend holds true, we could see the first commercial Cassandra company emerge around this time next year.
Sure, its name doesn’t roll off the tongue or conjure images of Greek gods, but Twitter’s newly open-sourced Gizzard tool does have promise. By eliminating some pain from the often difficult sharding process, Gizzard makes it easier to build and manage distributed data stores that can handle ultra-high query volumes without getting bogged down. Granted, it is early days for Gizzard outside of Twitter, but one can see how it might find mass applicability in web companies and, perhaps, in traditional companies a few years down the road. Like Google, Yahoo and Facebook before it, Twitter has played a role in evolving how we use the web, and software developed within its walls, like that of its predecessors, should be a hot commodity for present and future Twitter-inspired sites and products. (Although I suspect a name change might be in order first.)
Big data has narrowed the gap between the needs of bleeding-edge web companies, their offspring and even traditional businesses. Hadoop has caught on across industry boundaries as an analytics tool for unstructured data sets, and it seems logical that other web-based tools will catch on in other parts of the data layer. Cassandra and Gizzard look like strong candidates in their respective fields, but we’ll have to wait a while to find out for sure.