четвер, 20 березня 2014 р.

XQuery on Hadoop

Java is mother language for the most of Hadoop engineers. In recent years, Python became popular, R is used by data scientist on Hadoop. Pig Latin and HiveQL is de-facto the mainstream languages for Hadoop now days. Oracle decided to not stop on that and gives possibility to write MapReduce jobs in XQuery! Unbelievable, xml-fans must be happy :)

Let's review simple example.

First of all, Oracle BigData Lite VM must be downloaded (for free, but it takes 25Gb on disk).

After installation, test dataset must be create. I put 2 files to directory on HDFS /user/oracle/xquery/input with sample dataset about access to website. The example of content is:
2013-10-28T06:00:00, chrome, index.html, 200
2013-10-28T08:30:02, firefox, index.html, 200
2013-10-28T08:32:50, ie9, about.html, 200

Next step: create XQuery script (my_xquery.xq) to process data (simple grouping by date of visiting page)

import module "oxh:text";

for $line in text:collection("/user/oracle/xquery/input/*.txt")
let $split := fn:tokenize($line, "\s*,\s*")
let $time := xs:dateTime($split[1])
let $day := xs:date($time)
group by $day
return text:put($day || ", " || fn:count($line))


Now script is ready to be run, execute from command line:
hadoop jar $OXH_HOME/lib/oxh.jar my_xquery.xq -output /user/oracle/xquery/output -clean -ls

Options:
-output specify output directory
-clean remove output directory if exists
-ls list the content of output directory after run

Here is the result:


That's it, XQuery was translated to MapReduce (similar to Pig Latin or HiveQL). This functionality is the part of Oracle BigData Connectors for Hadoop and more information with examples might be read here