пʼятниця, 18 квітня 2014 р.

Building BuilData ETL with Hive and Oozie

Perhaps, Hive is the most successful component of today's Hadoop infrastructure. It provides simple and efficient way of creating Hadoop-based data processing jobs with comfortable SQL-like language. But, in contract to Pig, it's not a workflow-friendly language and requires additional effort to create a real multi-step ETL.
Oozie was created to eliminate workflow/scheduling issues and, obvious, may be used to create ETL and naturally engages Hive.



Workflow is a core component of any Oozie job and it is list of required steps to accomplish task. So, workflow gives a way to describe ETL and there is the example of using Hive in Oozie workflow:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="etl-by-month-wf" xmlns:sla="uri:oozie:sla:0.1">
  <start to="xxx"/>

  <action name="xxx">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <job-xml>${hiveSiteXml}</job-xml>
            <script>${projectSource}/first_step.hql</script>
            <param>hiveSchema=${hiveSchema}</param>
            <param>dataLocality=${dataOutput}</param>
            <param>flowID=${wf:id()}</param>
            <param>arg1=${argument}</param>
        </hive>
        <ok to="yyy"/>
        <error to="fail"/>
  </action>

  <action name="yyy">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <job-xml>${hiveSiteXml}</job-xml>
            <script>${projectSource}/second_step.hql</script>
            <param>hiveSchema=${hiveSchema}</param>
            <param>dataLocality=${dataOutput}</param>
            <param>flowID=${wf:id()}</param>
        </hive>
        <ok to="end"/>
        <error to="fail"/>
  </action>

  <kill name="fail">
        <message>Error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
  </kill>

  <end name="end"/>
</workflow-app>


Well, it describes two-steps job, content of executed hive scripts are located in first_step.hql and second_step.hql respectively (both located on HDFS).
Some preparations are required before start of using it
Put to HDFS hive-site.xml with added property:
<property>
<name>hive.exec.scratchdir</name>
<value>/user/cloudera/data/tmp</value>
</property>

Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows: On the HDFS cluster this is set to /tmp/hive- by default and is controlled by the configuration variable hive.exec.scratchdir On the client machine, this is hardcoded to /tmp/ - permission issue
After that, property file is required:
nameNode=hdfs://localhost.localdomain:8020
jobTracker=localhost.localdomain:8021
user.name=cloudera
base_url=${nameNode}/user/${user.name}

oozie.use.system.libpath=true
oozie.libpath=/user/oozie/share/lib/hive
hiveSiteXml=/user/cloudera/hive-site.xml

oozie.wf.application.path=${base_url}/start.dir/workflow.xml

hiveSchema=your_db

Put workflow to the path specified at oozie.wf.application.path. Also, directory lib may be created at this path and used for saving different jars required by workflow (for example, custom Hive UDF).
And the final step: run job on oozie server, it may be done with the next command (assume you put properties localy):
oozie job -oozie http://localhost:11000/oozie -config oozie.conf.properties -run

1 коментар:

  1. and self-reminder:
    the action must return something to gracefully, in terms of Oooze, compete hive-job, i.e. if your select create/populate table, it doesn't return result and Oozie can't recognize successful ending of job; in this case, return something, for example "select 1 from table_name;"

    ВідповістиВидалити