Yahoo Open-Sources Real-Time MapReduce

Updated: Yahoo (s yhoo) has open-sourced its S4 project, a platform for developing real-time MapReduce applications. As we’ve seen with Google’s (s goog) new Caffeine infrastructure for its Instant Search features, as well other “NoHadoop” tools, there’s a growing trend of unchaining large-scale data analysis – via MapReduce, in particular – from its batch-processing roots.

Inside Yahoo Labs, S4 is being used for “[a]pplications such as personalization, user feedback, malicious traffic detection, and real-time search.” The project website gives a high-level description of how S4 works:

In S4, we abstract the input data as streams of key-value pairs that arrive asynchronously and are dispatched intelligently to processing nodes that produce data sets of output key-value pairs. In search, for example, the output data sets are made available to the serving system before a user executes her next search query. We use this rapid feedback to adapt the search models based on user intent.

The S4 wiki provides more detailed information on the project, and code is available at github.

S4 should become a hot commodity among the community of MapReduce – particularly Hadoop – developers. Just as it has with certain tools developed for Yahoo’s Hadoop distribution, it seems likely Cloudera would incorporate S4 into its Hadoop distribution, which has established itself as solid choice among enterprise users. Perhaps Karmasphere, which sells a platform for developing Hadoop applications, will take up the S4 cause. Either way, S4 represents a free- to low-cost alternative presently available proprietary real-time processing options like multiple IBM InfoSphere products (s IBM) and SAP’s (s sap) new in-memory HANA appliance.

The recent analytics landgrab illustrates just how hungry customers are to derive insights from their personal data deluges. Churning through streaming data is probably still a ways out for many organizations, but having the tools to actually do it should help catalyze a few efforts. Workloads like those suggested by Yahoo for S4 bring enough value to make it at least worth a try.

To learn more about deploying the right cloud strategy for your needs, attend the free GigaOM Pro webinar, The Scalable Cloud. The webinar takes place at 10:00 a.m. PST on Nov. 4.

Image courtesy of Flickr user bdu.

Related content from GigaOM Pro (sub req’d):