Big Data SQL makes Hadoop servant, Oracle master

Ever since Cloudera’s October, 2012 announcement of its Impala SQL-on-Hadoop engine, it seems the database industry has been obsessed with fusing the SQL query language with Hadoop. These various pairings have roughly broken down into two broad groups: standalone SQL-on-Hadoop engines from Hadoop distribution vendors, and SQL-to-Hadoop bridges from various relational database and data warehouse vendors, including Teradata, HP Vertica, IBM and Microsoft.

SQL on Hadoop, redux
Oracle has had an offering out there too, in the form of its Big Data Connectors and, most interestingly, its Oracle SQL Connector for HDFS (OSCH). That connector allowed data in Hive tables and HDFS files to be imported into the Oracle Database catalog as external tables which could then be queried with Oracle SQL and even joined to physical tables in the Oracle Database.

My own observation was that Oracle didn’t push OSCH very hard, and I wasn’t certain why that was. But today the reason became apparent: Oracle had a much more sophisticated Hadoop integration technology under development. Tuesday, Oracle announced that technology, called Oracle Big Data SQL (BDS), to be made generally available on its Big Data Appliance this calendar quarter (that is, sometime before October).

Rather than just consume Oracle’s press release, I requested a briefing, and was able to speak with Dan McClary, Oracle’s principal product manager for Big Data and Hadoop. I was lucky to have this briefing; McClary was able to explain how BDS works in great detail.

What’s changed
So how different are OSCH and BDS? Vastly so. McClary explained to me that OSCH was never really intended to be used for interactive query, but rather for ETL-like scenarios. While that seemed just a tad revisionist to me, I was nonetheless impressed with BDS’ much greater capabilities. Essentially, what Oracle did with BDS was to build a translator of sorts that works natively with both Oracle and Hadoop.

On the Hadoop side, BDS uses its Smart Scan for Hadoop technology, which interfaces directly with Hadoop’s YARN cluster management layer to parallelize properly across the cluster. BDS can query Hadoop data in virtually any format, including custom formats, as long as a “SerDe” (serializer-deserializer) is available. BDS will also determine schema as it reads the data, a key concept in the Hadoop world. On the Oracle side, BDS returns the data in Oracle Block Stream format so the Exadata Database Machine can work with it natively. BDS and Smart Scan for Hadoop are, in McClary’s words, “Oracle on the top, Hadoop on the bottom.”

NoSQL and security
BDS also allows standard JSON functions to be used in Oracle SQL and will then query natively against JSON data in HDFS. That means semi-structured data can be queried with SQL and this capability will be useable against the Oracle NoSQL Database when BDS becomes generally available. McClary told me there was a strong possibility that this NoSQL interface could eventually be extended to work with HBase, Cassandra and even MongoDB.

Finally, BDS also makes use of Apache Sentry such that Oracle’s own role-based security scheme can be projected over Hadoop data. Other, more advanced security capabilities, including data redaction, can be enforced over Hadoop data, as long as that data is queried through Oracle. This makes possible a model whereby production Hadoop clusters are locked down, and users get to the cluster data exclusive via Oracle, through which very granular and specific role-based security is imposed and enforced.

Hadoop as workhorse
While the industry has no shortage of SQL-on-Hadoop solutions, SQL-to-Hadoop bridges are a bit different from standalone solutions like Hive and Impala. The bridges don’t merely allow SQL-knowledgeable professionals to work with Hadoop. Instead, they bring Hadoop data to specific database platforms, effectively utilizing Hadoop as a specialized, embedded engine, rather than exposing it as a new database platform in its own right.

I covered Actian’s Hadoop SQL Edition in my Weekly Update two weeks ago. Actian’s product, which integrates its Vector database with Hadoop, was coincidentally made generally available Tuesday, the same day BDS was announced. Big Data SQL and Hadoop SQL Edition take different architectural approaches, but they both embed Hadoop. Microsoft does something similar with its Analytics Platform System, released in April.

Like other infrastructure, Hadoop is most powerful when it recedes from view and works behind the scenes. So expect to see more Hadoop-embedded solutions emerge. In fact, Oracle’s McClary said BDS technology may even make its way from Exadata and the Big Data Appliance to the mainstream Oracle database.