NetApp does network-attached Hadoop

Seeking to appease enterprise customers demanding more-reliable and efficient Hadoop clusters to power their big data efforts, NetApp (s ntap) has partnered with Cloudera to deliver a preconfigured Hadoop storage system. Called the NetApp Open Solution for Hadoop, the new product combines Cloudera’s Hadoop distribution and management software with a NetApp-built RAID architecture.

As we’ve explained here before, Hadoop is great for storing and processing large quantities of unstructured data, but it does involve a fair amount of operational complexity to keep it running smoothly. As Cloudera’s head of business development Ed Albanese explained to me, because Hadoop architectures generally involve the computing and storage layers residing on the same commodity servers, it can be tedious and difficult to pull and replace a server if a disk goes down. And although that architecture does result in relatively low costs and high performance, it also can be power-hungry, because users are left scaling both layers when they might only need to scale one of the two.

And clusters have been growing, presumably to keep pace with fast-growing data volumes. In July, Cloudera’s Omer Trajan noted that customers’ average cluster size had grown to more than 200 nodes, and that 22 customers were managing at least a petabyte of data in their clusters.

The ultimate goal of the new NetApp product, Albanese said, is threefold: 1) to separate the compute and storage layers of Hadoop so each can scale independently; 2) to fit with next-generation data center models around efficiency and space savings; and 3) to improve reliability by being able to hot-swap failed drives and otherwise leverage NetApp’s storage expertise. Cloudera will actually start shipping a version of its Cloudera Enterprise management software designed specifically for this system. That will make it easier for customers to monitor storage performance and know when to add or replace drives.

Jeff O’Neal, NetApp’s senior director of data center solutions, added that his company’s foray into Hadoop will maintain performance levels even though data is now traversing the network to get to the compute nodes from the storage system. The data and compute loads are still logically connected, he said, and the storage layer maintains the Hadoop Distributed File System’s native shared-nothing architecture.

The product comes at the right time, as Apache-centric versions of Hadoop have come under fire from newcomers such as MapR and its partner EMC (s emc), which claim they can deliver better performance, reliability and availability than can HDFS. For enterprise customers that prefer the open-source nature of Cloudera’s Hadoop distribution but are rightfully concerned about HDFS reliability because they’re running production Hadoop workloads, the NetApp solution likely will provide a welcome point of comparison.

Interestingly, NetApp first mentioned its NetApp Open Solution for Hadoop when Yahoo (s yhoo) spinoff — and Cloudera competitor — Hortonworks launched in June. NetApp signed on as a Hortonworks partner early, claiming the new system would support the Hortonworks Hadoop distribution. O’Neal declined to comment on the Hortonworks situation, although considering that Hortonworks just announced its first software last week, it’s possible NetApp will still add a Hortonworks edition of its “open solution” when those products are ready for production.

Feature image courtesy of Flickr user miheco.