Splunk connects with Hadoop to master machine data

Splunk has integrated its flagship product with Apache Hadoop to enable large-scale batch analytics on top of Splunk’s existing sweet spot around real-time search, analysis and visualization of server logs and other machine-generated data. Splunk has long had to answer questions about why anyone should use its product over Hadoop, and the new integration not only addresses those concerns but actually opens the door for hybrid environments.

Machine data can be very valuable, as the information it provides not only lets administrators troubleshoot specific problems but also can help organizations identify problematic trends. These might be server-level bugs that make IT systems run less than optimally or page-load problems that consistently drive web visitors to leave a company’s site early. Alternatively, machine data can help organizations identify positive trends, such as what browsers or operating systems visitors most frequently use, which can help determine where to focus development dollars.

Now Splunk users have a variety of options for what to do with their data. They can keep using Splunk as normal and just port the data it collects to Hadoop, or they can bring Hadoop data back into Splunk to make the results of Hadoop jobs easier to visualize and sort through. Further, users can actually submit MapReduce jobs to a Hadoop cluster via Splunk, making it that much easier to utilize the joint environment.

The Hadoop integration should result in wins for both sides of the equation. Hadoop is popular but also complex, which is why we are now starting to see products emerge to simplify the process of running and visualizing Hadoop jobs. That makes Splunk a natural complement to Hadoop deployments where machine data is involved. For Splunk, the integration means its nearly 3,000 paying customers might end up spending even more as they expand their use of the product to include Hadoop-related tasks.

One such customer might be e-commerce site Etsy, which uses both Splunk and Hadoop extensively to analyze machine data in order to determine how customers are interacting with the site. While these were previously separate environments — Splunk doing the real-time analysis and Hadoop doing nightly batch analysis — Etsy could now start using both products in tandem if it were so inclined.

According to Splunk Co-Founder and CTO Erik Swan, Hadoop might be just the first next-generation data store that his company decides to integrate with. Many NoSQL databases suffer from the same problems as Hadoop in that they lack easy user interfaces and visualization capabilities, he said, so they seem like a natural next step. Hadoop had to be first, though, because customer interest was so high.

Image courtesy of Flickr user OxOx