What it really means when someone says ‘Hadoop’

Big data is among the hottest trends in IT right now, and Hadoop stands front and center in the discussion of how to implement a big data strategy. There’s just one problem that keeps cropping up: many people don’t seem to know exactly what it means when somebody says “Hadoop.”

The problem surfaced again Monday in the form of complaints over Forrester’s new report titled “Enterprise Hadoop Solution, Q1 2012.” InformationWeek spoke with a few vendors that didn’t like how their products were assessed, and database industry analyst Curt Monash says the report “compares apples, peaches, almonds, and peanuts.” I thought the same thing when I saw a copy of the report last week. They all focus on Hadoop, but Hortonworks is not Datameer is not HStreaming.

Allow me to explain. Hopefully, this provides a foundation for parsing what people talk about when they talk about Hadoop, and for differentiating one type of product from another. (And you can learn even more about Hadoop and how it’s used at our Structure: Data conference taking place next month in New York City.)

What Hadoop is

I went into this in more detail in a GigaOM Pro report published last March (sub req’d), but the long and short is that Hadoop is, at its core, an Apache Software Foundation project consisting of two primary subprojects — Hadoop MapReduce and the Hadoop Distributed File System. MapReduce is the parallel-processing engine that allows Hadoop to churn through large data sets in relatively short order. HDFS is the distributed file system that lets Hadoop scale across commodity servers and, importantly, store data on the compute nodes in order to boost performance (and potentially save money). These are the two must-have components for any Hadoop distribution.

There are also a number of Apache projects related to Hadoop, often built atop either Hadoop MapReduce or HDFS. These include — but are not limited to — Hive and Pig, two SQL-like query languages to provide data-warehouse-like capabilities to a Hadoop cluster, and HBase, a NoSQL database that leverages HDFS as its distributed storage engine.

Hadoop distributions

These are packaged software products that aim to ease deployment and management of Hadoop clusters compared with simply downloading the various Apache code bases and trying to cobble together a system. Presently, Cloudera, Hortonworks, MapR and EMC (s emc) all offer their own Hadoop distributions. Although they’re all unique — sometimes very unique, as with MapR’s proprietary file system — they all package a set of Hadoop projects (MapReduce, Hive, Sqoop, Pig, etc.) in a way that in theory makes them integrate more naturally, and to run both smoothly and securely.

Many Hadoop distributions integrate with various data warehouses, databases and other data-management products, with the goal of moving data between Hadoop clusters and other environments so each might process or query data stored in the other.

Hadoop management software

Just as the wording implies, Hadoop management software is designed to make it easier to manage and troubleshoot a Hadoop cluster. Such products are usually sold or offered by companies peddling Hadoop distributions, because even when commercially packaged, Hadoop is still a complex architecture and somewhat foreign to most IT personnel and products. However, third parties such as Platform Computing (now part of IBM (s ibm)) and Zettaset also sell software for managing Hadoop clusters, and their products are typically agnostic as to what distributions they support.

But distributions and management software are all about the infrastructure and the platform. Anyone actually wanting to use Hadoop still needs to know how to write applications that leverage the underlying architecture.

Hadoop application software (or, products that use Hadoop)

The Hadoop ecosystem gets really complex when we start looking at products that exist to help developers write Hadoop applications or otherwise analyze data stored within Hadoop in a manner other than writing traditional MapReduce jobs. These range from abstraction layers such as Karmasphere Analyst or IBM Infosphere BigInsights, to Hadapt, which offers a single-platform product fusing a SQL data warehouse with a Hadoop cluster, to HStreaming, which promises real-time processing and analytics.

The one common thing among all these products, however, is that they are not Hadoop distributions, but sit atop platform software from Hortonworks, EMC or whomever. Some products that get thrown into the Hadoop fray, such as Outerthought Lily or Drawn to Scale Spire, are essentially scale-out databases built atop HBase (which itself is a separate project built atop HDFS). The image below, from Karmasphere, gives a particularly clear map of how a Hadoop environment might look.

The applications and analytics space is probably where we’ll see the biggest influx of new companies, as writing Hadoop applications is still tough, but it’s also how companies will actually start experiencing direct business benefits. In fact, it’s these type of higher-level products that are the focal point of Accel Partners’ new big data fund.