Oozie was created to eliminate workflow/scheduling issues and, obvious, may be used to create ETL and naturally engages Hive.
Workflow is a core component of any Oozie job and it is list of required steps to accomplish task. So, workflow gives a way to describe ETL and there is the example of using Hive in Oozie workflow:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="etl-by-month-wf" xmlns:sla="uri:oozie:sla:0.1"> <start to="xxx"/> <action name="xxx"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>${hiveSiteXml}</job-xml> <script>${projectSource}/first_step.hql</script> <param>hiveSchema=${hiveSchema}</param> <param>dataLocality=${dataOutput}</param> <param>flowID=${wf:id()}</param> <param>arg1=${argument}</param> </hive> <ok to="yyy"/> <error to="fail"/> </action> <action name="yyy"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>${hiveSiteXml}</job-xml> <script>${projectSource}/second_step.hql</script> <param>hiveSchema=${hiveSchema}</param> <param>dataLocality=${dataOutput}</param> <param>flowID=${wf:id()}</param> </hive> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
Well, it describes two-steps job, content of executed hive scripts are located in first_step.hql and second_step.hql respectively (both located on HDFS).
Some preparations are required before start of using it Put to HDFS hive-site.xml with added property:
<property> <name>hive.exec.scratchdir</name> <value>/user/cloudera/data/tmp</value> </property>
Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows: On the HDFS cluster this is set to /tmp/hive-
After that, property file is required:
nameNode=hdfs://localhost.localdomain:8020 jobTracker=localhost.localdomain:8021 user.name=cloudera base_url=${nameNode}/user/${user.name} oozie.use.system.libpath=true oozie.libpath=/user/oozie/share/lib/hive hiveSiteXml=/user/cloudera/hive-site.xml oozie.wf.application.path=${base_url}/start.dir/workflow.xml hiveSchema=your_db
And the final step: run job on oozie server, it may be done with the next command (assume you put properties localy):
oozie job -oozie http://localhost:11000/oozie -config oozie.conf.properties -run
and self-reminder:
ВідповістиВидалитиthe action must return something to gracefully, in terms of Oooze, compete hive-job, i.e. if your select create/populate table, it doesn't return result and Oozie can't recognize successful ending of job; in this case, return something, for example "select 1 from table_name;"