Updated: Yahoo (s yhoo) will be spinning off a separate company focused on the development and commercialization of Apache Hadoop, called Hortonworks. The official announcement likely will come tomorrow or Wednesday to coincide with Yahoo’s annual Hadoop Summit, but rumors have been circulating for months and I confirmed the news today with a source familiar with the project.
As the originator of the Hadoop technology, Yahoo’s official entry into this space should play a big role in shaping how the market of Hadoop-based products evolves.
Yahoo’s Hortonworks (as in the Dr. Suess book “Horton Hears a Who,” a reference to the elephant logo that Apache Hadoop bears) will be comprised of a small team of Yahoo’s Hadoop engineers and will focus on developing a production-ready product based on the Apache Hadoop project, the set of open source tools designed for processing huge amounts of unstructured data in parallel. It’s a natural step for Yahoo, which uses Hadoop heavily within its own web operations, and which has contributed approximately 70 percent of the code to Apache Hadoop since the project’s inception.
By incorporating next-generation features and capabilities, Hortonworks hopes to make Hadoop easier to consume and better suited for running production workloads. Its products, which likely will include higher-level management tools on top of the core MapReduce and file system layers, will be open source and Hortonworks will try to maintain a close working relationship with Apache. The goal is to make HortonWorks the go-to vendor for a production-ready Hadoop distribution and support, but also to advance Yahoo’s repeated mission of making the official Apache Hadoop distribution the place to go for core software. Earlier this year, Yahoo discontinued its own Hadoop distribution, recommitting all that code and all its development efforts to Apache.
The introduction of Hortonworks means that other companies peddling Hadoop-based products can’t rest on their laurels. Cloudera, which pioneered commercial Hadoop, and EMC (s emc), which just launched its own set of Hadoop tools — a community version based on Facebook’s optimized Hadoop code, and an enterprise version leveraging MapR’s technology — are now on notice. Hortonworks differs from Cloudera because Hortonworks is more involved in software development, and the spinout’s tight alliance with Apache renders it distinct from the EMC products. Yet, Hortonworks will have to ensure it advances Hadoop development across industry lines and not just in a manner optimized for Yahoo’s webscale needs if it wants to gain adoption.
Despite all the talk about Hadoop, evidence suggests a presently paltry revenue base for the software Hortonworks, Cloudera and EMC peddle. Cloudera is leading the charge right now with what I’ve heard is a few million in annual revenue, but that’s hardly enough to sustain the amount of investment in Hadoop. Cloudera alone has raised $36 million, VCs have funded a number of other Hadoop-focused startups, and companies such as EMC and IBM (s ibm) are funding Hadoop strategies from their own coffers. Everyone with a stake in the outcome of Hadoop envisions a billion-dollar opportunity, so seeing how, or if, these companies are able to split the market and share revenue at least three ways makes this a fun race to watch. They also face increased competition from Hadoop alternatives such as LexisNexis spinoff HPCC Systems and Microsoft’s forthcoming Dryad tools.
Hortonworks will be a joint venture between Yahoo and an investor, presumably Benchmark Capital. The Wall Street Journal reported in May that Benchmark was in talks with Yahoo about how to handle launching the new company.
Update: Yahoo and Benchmark Capital officially launched Hortonworks on Tuesday afternoon. Eric Baldeschwieler, formerly VP of software engineering for the Hadoop team at Yahoo, will serve as CEO. NetApp is already on board as a Hortonworks ecosystem partner, supporting the distribution with its new NetApp Hadoop Open Storage System. Referred to internally as “Hadooplers,” HOSS centers around E-Series-based RAID configurations that “design dramatically improves the performance, scalability and predictability of congested Hadoop cluster networks by offloading most data ingest and object reconstruction (aka re-silvering) traffic.”