How to Use Open-Source Hadoop for the Smart Grid

At first glance it’s hard to see how the open-source software framework Hadoop, which was developed for analyzing large data sets generated by web sites, would be useful for the power grid — open-source tools and utilities don’t often mix. But that was before the smart grid and its IT tools started to squeeze their way into the energy industry. Hadoop is in fact now being used by the Tennessee Valley Authority (TVA) and the North American Electric Reliability Corp. (NERC) to aggregate and process data about the health of the power grid, according to this blog post from Cloudera, a startup that’s commercializing Hadoop.

The TVA is collecting data about the reliability of electricity on the power grid using phasor measurement unit (PMU) devices. NERC has designated the TVA system as the national repository of such electrical data; it subsequently aggregates info from more than 100 PMU devices, including voltage, current, frequency and location, using GPS, several thousand times a second. Talk about information overload.

But TVA says Hadoop is a low-cost way to manage this massive amount of data so that it can be accessed all the time. Why? Because Hadoop has been designed to run on a lot of cheap commodity computers and uses two distributed features that make the system more reliable and easier to use to run processes on large sets of data.

The first important feature is Hadoop’s Distributed File System. It’s modeled on Google’s (s GOOG) File System, which distributes file system data across multiple servers and maintains multiple copies of all of it. The idea is that there will often be system failures, so when one server goes down, the information can still be accessed. Further, the system is able to constantly restore outages. The TVA says it “liked the idea of being able to lose whole physical machines and still have an operational file system due to Hadoop’s aggressive replication scheme.”

The other key part of Hadoop’s software is a Distributed Processing Framework, which uses an algorithm popularized by Google called “MapReduce” to partition compute jobs out to hundreds or thousands of nodes. MapReduce divides applications into bite-sized chunks of work across servers, processing the data where it is located. The TVA says it likes this feature because NERC and its researchers can access and run operations on the electrical data across the servers, in parallel, for quick results.

For the TVA, it’s about performance and price, according to Cloudera’s blog post:

In the end, Hadoop is a good fit for this project in that it allows us to employ commodity hardware and open source software at a fraction of the price of proprietary systems to achieve a much more manageable expenditure curve as our repository grows.

Given all the information that will be unearthed via the buildout of new transmission and distribution systems, as well as via the home energy management tools that emerge as part of the smart grid, cheap, powerful tools like Hadoop will inevitably make their way through even more industries, even ones in which terms like open source aren’t yet commonplace.