In last week’s post, I summarized news that was shared with me before Strata + Hadoop World and revealed on the opening day of the show. In this week’s post, I add one more item to that list, discuss some additional announcements revealed during the show itself, and discuss a few more developments that have taken place this week.
The Spark didn’t dim
First off, I missed an important item that was available before the show started. Self-service data prep pure play Paxata announced last Tuesday that the Fall 2014 release of their Adaptive Data Preparation product. This release works in-memory and is built on — you guessed it — Apache Spark. Nenshad Bardoliwalla, Paxata’s Co-Founder and VP of Products, with whom I spoke while at Strata + Hadoop World, freely admitted that we’re still in the early days of Spark and told me that the Paxata team built a lot extra code on top of the base Spark engine to make it work to Paxata’s requirements.
Another product built to sit right on top of Spark is ClearStory Data, a turnkey data exploration environment that handles everything from ingest to visualization. The company used Strata + Hadoop world to announce the product release of Collaborative StoryBoards, a data storytelling feature that adds presentation capabilities to ClearStory, letting users avoid the copy and pasting of visualizations into PowerPoint, Keynote or any other slide/presentation package.
Want some machine learning to go with that storytelling? Seattle-based GraphLab announced the general availability of version 1.0 of its GraphLab Create platform, that integrates with Hadoop and Spark. Designed for data scientists and developers who are comfortable working with functional programming, the product includes a Predictive Services component that can deploy models to Amazon Web Services’ EC2 platform; deep learning capabilities; visualization features; and Boosted Trees algorithms.
And, no, the built-on-Spark story doesn’t end there. Hadoop-based analytics vendor Platfora announced version 4.0 of its eponymous product at Strata + Hadoop world, and revealed that with this release, the Platfora product, too, sits atop Spark. Ben Werther, Platfora’s CEO sat down with me at Strata to explain that Platfora’s Spark-ification goes well beyond a retrofit to use Spark as the underlying engine. Platfora now accommodates “Platfora Platform Extensions,” which are in fact Spark modules with a thin wrapper layer to make them snap into Platfora. Werther also let me know that Platfora 4.0 adds sophisticated geo analysis and mapping capabilities. I first met Werther before the Platfora product had GA’d, back in the bad old MapReduce days of Hadoop. Fast forward to today, and we see that Spark has made the company’s bet on Hadoop pay off.
More Cloudera partnerships
Speaking of betting on Hadoop (and Spark) a couple of additional Cloudera partnerships were revealed after I posted last week’s update. One of these partnerships is with Informatica, and results in the integration of that company’s Big Data Edition product with Cloudera Navigator, providing Cloudera’s data governance control center with access to the myriad data sources with which Informatica connects.
The other Cloudera partnership announced during Strata is a biggie: Red Hat. This partnership means the companies will “deliver joint solutions to enterprise customers with cooperative documentation, marketing and support” according to Red Hat’s press release on the alliance. Under the tie-up, Cloudera Enterprise, Navigator and Director (the company’s new cloud deployment component) will integrate with Red Hat Enterprise Linux, OpenStack, Sahara, CloudForms and Storage Server. In addition, JBoss Middleware and OpenShift will integrate with Cloudera’s Kit Libraries and Impala.
Another major Cloudera partnership was announced after Strata ended. I’ll cover that one shortly.
Supercomputing is alive and well
For those like me, who pushed into technology careers in the 80s and 90s, the name Cray is almost legendary. We just have to close our eyes and we can see large, sleek advanced-cooling cabinets filled with supercomputing hardware and adorned with the Cray logo. And given that Hadoop is very much a latter day supercomputing platform, it makes sense that Cray and Hadoop would at some point overlap.
That point in time has come, as Cray announced at Strata + Hadoop World the release of its Urika-XA Big Data analytics platform, a beefy appliance that is pre-integrated with Hadoop and (what else?) Apache Spark. The appliance –from the company’s YarcData unit (spell Yarc backwards and you’ll understand the genesis of the name) can contain up to 48 compute nodes in a single rack, with integrated SSDs, Intel Xeon processors and Cray’s Sonexion storage system. On the storage side, Cray includes the Cray Adaptive Runtime for Hadoop.
The 80s are back, baby! I’m hoping we’ll soon see a version of dBase built to run on YARN.
A cure for Post-Strata Depression?
While Strata does create a huge news cycle, and a quiet week beforehand, there’s more news this week too. First up: SAP announced a new Service Pack 9 release of HANA. And while the ERP behemoth may refer to the release as a mere service pack, there’s big stuff in it, including multi-tenancy features; a new dynamic tiering facility for management of very large data sets; a data quality component; streaming data capabilities; an ACID-compliant graph database engine; and user-defined functions for Hadoop, that let HANA execute MapReduce jobs on Hadoop directly.
If that’s not enough for you, how about Snowflake Computing, a new cloud data warehouse challenger to Amazon RedShift, led up by former Microsoft and Juniper Networks executive Bob Muglia? The company came out of stealth just yesterday, and were kind enough to explain to me that its Elastic Data Warehouse, like other data warehouse platforms, is columnar and uses a Massively Parallel Processing (MPP) design, but unlike others, the node-level database engine is brand new, and not based on PostgreSQL, or any other pre-existing relational engine.
While building a new relational engine might ordinarily be a fool’s errand, Snowflake’s founding team includes top-gun engineering talent from Oracle and Vectorwise, who seem like the right folks to take on the challenge. This team built an engine that is optimized for cloud storage (allowing the product to scale compute and storage separately); can handle semi-structured data; can micro-batch its operations; and run on a highly parallelized basis within each worker node, in such a way as to maximize CPU utilization on all nodes.
And one more thing
I mentioned there was one more Cloudera partnership I needed to share with you, and that it was rather non-trivial. To cut to the chase, Cloudera and Microsoft have forged a partnership that will anoint Microsoft Azure as the preferred cloud platform for Cloudera Enterprise.
As I happen to be in San Francisco this week for Gigaom’s own Structure: Connect event, I was invited and able to attend a small event here on Monday that promised Microsoft CEO Satya Nadella and Corporate Vice President of Enterprise and Cloud (Nadella’s old job), Scott Guthrie, discussing the future of Microsoft’s cloud. Imagine my surprise as, in addition to those two Microsoft execs, Cloudera co-founder Mike Olson took the stage to demonstrate the single-click launching of a 90-node Cloudera Enterprise cluster, from the beta Azure Management Portal, and followed that up with a show-and-tell of Microsoft’s Power BI visualizing data coming from Impala on that same cluster. Just for giggles, Olson did the demo on a machine running Windows 10.
Cloudera itself runs, of course, on Linux and not Windows. But that’s OK, because at the very same event Nadella proclaimed that “Microsoft loves Linux” and that 20% of Azure runs on Linux. I don’t know if that means 20% of revenue, or of virtual machines spun up, or some other measure, but it’s noteworthy regardless. While the 80s may be back, the old Microsoft is long gone.
Where does this leave Hortonworks, on whose HDP for Windows Hadoop distribution Microsoft’s cloud Hadoop Service, HDInsight, is based? In fact, during the event’s Q&A time, I asked Guthrie whether a Cloudera-based flavor of HDInsight might one day appear. Guthrie responded with some “it’s all good” positioning on Azure’s ability to mix and match Linux and Windows, IaaS architecture (on which Cloudera for Azure is based) and PaaS architecture (on which HDInsight is premised) and, I suppose, Cloudera and Hortonworks, by extension.
In other words, Guthrie didn’t, and perhaps couldn’t, answer my question. He also didn’t have to, because this move was more about Azure than it was about Hadoop. And for Azure, this was an ace move.
I’ll be back next week with a shorter post. Unless, of course, it’s announced that all major refrigerators, washer/dryer units, and in-car computing systems are retrofit to run on Spark.