eBay’s new Pulsar framework will analyze your data in real time

eBay has a new open-source, real-time analytics and stream-processing framework called Pulsar that the company claims is in production and is available for others to download, according to an eBay blog post on Monday. The online auction site is now using Pulsar to gather and process all the data pertaining to user interactions and their behaviors and said that the framework “scales to a million events per second with high availability.”

While eBay uses Hadoop for its batch processing and analytics needs, the company said it now needs a way to process and analyze data in real time for better personalization, fraud and bot detection and dashboard creation, among others.

For a system to be able to achieve what eBay is calling for, it needs to be able to process millions of events per second, have low latency with “sub-second event processing and delivery” and needs to be spread out across multiple data centers with “no cluster downtime during software upgrade,” according to the blog post.

eBay decided the best way to go about this was to build its own complex event processing framework (CEP), which also includes a Java-based framework on top of which developers can build other applications.

eBay pulsar pipeline
eBay pulsar pipeline

Developers skilled with SQL should feel at home with Pulsar because the framework can be operated with a “SQL-like event processing language.”

The real-time analytics data pipeline built into Pulsar is essentially a combination of a variety of components that are linked together (but can function independently) and form the data-processing conveyor belt from which all that user data flows through. Some of these components include a data collector, an event distributor and a metrics calculator.

It’s within Pulsar that eBay can add additional information to enrich the data — like geo-location information — remove unnecessary data attributes and compile together a bunch of events and “add up metrics along a set of dimensions over a time window.”

The whole idea is to have all that real-time data available in Pulsar to be treated “like a database table” in which developers can run the necessary SQL queries for analytic purposes, the post stated.

From the eBay blog:
[blockquote person=”eBay” attribution=”eBay”]Pulsar CEP processing logic is deployed on many nodes (CEP cells) across data centers. Each CEP cell is configured with an inbound channel, outbound channel, and processing logic. Events are typically partitioned based on a key such as user id. All events with the same partitioned key are routed to the same CEP cell. In each stage, events can be partitioned based on a different key, enabling aggregation across multiple dimensions. To scale to more events, we just need to add more CEP cells into the pipeline. [/blockquote]

Here’s what the Pulsar deployment architecture looks like:

Pulsar deployment
Pulsar deployment

Plans are on the way for Pulsar to include its own dashboard and real-time reporting API and integrate with other similar services, like the Druid open-source database for real-time analysis. The Druid database, created by the analytics startup Metamarkets (see disclosure), just moved over to the Apache 2 software license to attract more users.

Pulsar is open sourced under the Apache 2.0 License and the GNU General Public License version 2.0.

Disclosure: Metamarkets is a portfolio company of True Ventures, which is also an investor in Gigaom.