Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop wasn’t the star of the show. Based on the news I saw coming out of the event, it’s another Apache project — Spark — that has people excited.
There was, of course, some big Hadoop news this week. Pivotal announced it’s open sourcing its big data technology and essentially building its Hadoop business on top of the [company]Hortonworks[/company] platform. Cloudera announced it earned $100 million in 2014. Lost in the grandstanding was MapR, which announced something potentially compelling in the form of cross-data-center replication for its MapR-DB technology.
But pretty much everywhere else you looked, it was technology companies lining up to support Spark: Databricks (naturally), Intel, Altiscale, MemSQL, Qubole and ZoomData among them.
Spark isn’t inherently competitive with Hadoop — in fact, it was designed to work with Hadoop’s file system and is a major focus of every Hadoop vendor at this point — but it kind of is. Spark is known primarily as an in-memory data-processing framework that’s faster and easier than MapReduce, but it’s actually a lot more. Among the other projects included under the Spark banner are file system, machine learning, stream processing, NoSQL and interactive SQL technologies.

In the near term, it probably will be that Hadoop pulls Spark into the mainstream because Hadoop is still at least a cheap, trusted big data storage platform. And with Spark still being relatively immature, it’s hard to see too many companies ditching Hadoop MapReduce, Hive or Impala for their big data workloads quite yet. Wait a few years, though, and we might start seeing some more tension between the two platforms, or at least an evolution in how they relate to each other.
This will be especially true if there’s a big breakthrough in RAM technology or prices drop to a level that’s more comparable to disk. Or if Databricks can convince companies they want to run their workloads in its nascent all-Spark cloud environment.
Attendees at our Structure Data conference next month in New York can ask Spark co-creator and Databricks CEO Ion Stoica all about it — what Spark is, why Spark is and where it’s headed. Coincidentally, Spark Summit East is taking place the exact same days in New York, where folks can dive into the nitty gritty of working with the platform.
There were also a few other interesting announcements this week that had nothing to do with Spark, but are worth noting here:
- [company]Microsoft[/company] added Linux support for its HDInsight Hadoop cloud service, and Python and R programming language support for its Azure ML cloud service. The latter also now lets users deploy deep neural networks with a few clicks. For more on that, check out the podcast interview with Microsoft Corporate Vice President of Machine Learning (and Structure Data speaker) Joseph Sirosh embedded below.
- [company]HP[/company] likes R, too. It announced a product called HP Haven Predictive Analytics that’s powered by a distributed version of R developed by HP Labs. I’ve rarely heard HP and data science in the same sentence before, but at least it’s trying.
- [company]Oracle[/company] announced a new analytic tool for Hadoop called Big Data Discovery. It looks like a cross between Platfora and Tableau, and I imagine will be used primarily by companies that already purchase Hadoop in appliance form from Oracle. The rest will probably keep using Platfora and Tableau.
- [company]Salesforce.com[/company] furthered its newfound business intelligence platform with a handful of features designed to make the product easier to use on mobile devices. I’m generally skeptical of Salesforce’s prospects in terms of stealing any non-Salesforce-related analytics from Tableau, Microsoft, Qlik or anyone else, but the mobile angle is compelling. The company claims more than half of user engagement with the platform is via mobile device, which its Director of Product Marketing Anna Rosenman explained to me as “a really positive testament that we have been able to replicate a consumer interaction model.”
If I missed anything else that happened this week, or if I’m way off base in my take on Hadoop and Spark, please share in the comments.
[soundcloud url=”https://api.soundcloud.com/tracks/191875439″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]