What are Apache Oozie Action Nodes?

Apache Oozie action nodes are used to define the jobs which are the separate units of work that are attached to make up the Oozie workflow. By using action nodes a workflow starts the execution of a computation/processing task of various types of Hadoop jobs.


Action Definition

Actions can be defined in the workflow XML by using a set of elements that are specific and relevant to that action type. Some of the elements are very common in all the action types such as the Pig action would require a "script" element, but at the same time, Java action will not. A workflow system can be customized for Hadoop that makes it easy and natural for users to define all these actions which are destined for processing various Hadoop tools.


Action Types

The following action nodes are supported by Apache Oozie.

  1. MapReduce Action
  2. Java Action
  3. Pig Action
  4. FS Action
  5. Sub-Workflow Action
  6. Hive Action
  7. DistCp Action
  8. Email Action
  9. Shell Action
  10. SSH Action
  11. Sqoop Action

Let us see each action type in detail.


1. MapReduce Action

The Map-Reduce action is used to start the Hadoop Map-Reduce job from a workflow. It can be configured to perform file system cleanup and directory creation before starting the map-reduce job. To run Hadoop/Map-Reduce jobs, we need to configure all the Hadoop JobConf properties.

The following is an order which should be considered while writing the action definition in workflows.

  • job-tracker (compulsory)
  • name-node (compulsory)
  • prepare
  • streaming or pipes
  • job-xml
  • configuration
  • file
  • archive

1.1 Streaming

Streaming jobs detail can be mentioned in the streaming element as these jobs run binaries or scripts and require executables. Some of the streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts.

Streaming jobs supports the following elements.

  • mapper
  • reducer
  • record-reader
  • record-reader-mapping
  • env

Let us see the example of streaming.

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1"> ... <action name="firstjob"> <map-reduce> <job-tracker>foo:8021</job-tracker> <name-node>bar:8020</name-node> <prepare> <delete path="${output}"/> </prepare> <streaming> <mapper>/bin/bash testarchive/bin/mapper.sh testfile</mapper> <reducer>/bin/bash testarchive/bin/reducer.sh</reducer> </streaming> <configuration> <property> <name>mapred.input.dir</name> <value>${input}</value> </property> <property> <name>mapred.output.dir</name> <value>${output}</value> </property> <property> <name>stream.num.map.output.key.fields</name> <value>3</value> </property> </configuration> <file>/home/cloudduggu/testfile.sh#testfile</file> <archive>/home/cloudduggu/testarchive.jar#testarchive</archive> </map-reduce> <ok to="end"/> <error to="kill"/> </action> ... </workflow-app>


1.2 Pipes

Pipes are used to running C++ programs more gracefully. Although it is not very famous. A user-defined program must be bundled with the workflow application. Certain pipe jobs need the file to be present on HDFS for mapper/reducer scripts and this is accomplished using file and archive elements. We can overrule Pipe properties by mentioning them in the job-XML file or by mentioning them in the configuration element.

Pipes jobs support the following elements.

  • map
  • reduce
  • inputformat
  • partitioner
  • writer
  • program

Let us see the example of Pipes.

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1"> ... <action name="firstjob"> <map-reduce> <job-tracker>foo:8021</job-tracker> <name-node>bar:8020</name-node> <prepare> <delete path="${output}"/> </prepare> <streaming> <mapper>/bin/bash testarchive/bin/mapper.sh testfile</mapper> <reducer>/bin/bash testarchive/bin/reducer.sh</reducer> </streaming> <configuration> <property> <name>mapred.input.dir</name> <value>${input}</value> </property> <property> <name>mapred.output.dir</name> <value>${output}</value> </property> <property> <name>stream.num.map.output.key.fields</name> <value>3</value> </property> </configuration> <file>/home/cloudduggu/testfile.sh#testfile</file> <archive>/home/cloudduggu/testarchive.jar#testarchive</archive> </map-reduce> <ok to="end"/> <error to="kill"/> </action> ... </workflow-app>


2. Java Action

Java action is used to run custom java code on Hadoop Cluster. It will execute the public static void main(String[] args) method of the specified main Java class. The Java applications are executed in the Hadoop cluster as a map-reduce job with a single Mapper task. A java action can be configured to perform HDFS files/directories clean up or HCatalog partitions clean up before starting the Java application. This helps Oozie to reinitiate a Java application in the condition of a transient or non-transient failure. Before starting the Java application, a java action can be configured to perform HDFS files/directories cleanup or HCatalog partitions cleanup.

Java action contains the following elements.

  • job-tracker (compulsory)
  • name-node (compulsory)
  • prepare
  • configuration
  • main-class (compulsory)
  • java-opts
  • arg
  • file
  • archive
  • capture-output

Let us see the example of Java action.

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1"> ... <action name="myfirstjavajob"> <java> <job-tracker>foo:8021</job-tracker> <name-node>bar:8020</name-node> <prepare> <delete path="${jobOutput}"/> </prepare> <configuration> <property> <name>mapred.queue.name</name> <value>default</value> </property> </configuration> <main-class>org.apache.oozie.MyFirstMainClass</main-class> <java-opts>-Dblah</java-opts> <arg>argument1</arg> <arg>argument2</arg> </java> <ok to="myotherjob"/> <error to="errorcleanup"/> </action> ... </workflow-app>


3. Pig Action

Pig action executes a Pig job in Hadoop. Pig uses a Latin programming language to run Hadoop jobs. Pig translates Pig script into MapReduce jobs for Hadoop. Before we start a Pig job, we can configure pig action to perform HDFS files/directories cleanup. This helps Oozie to reinitiate a Pig job in the condition of a transient failure.

Pig action contains the following elements.

  • scrjob-tracker (compulsory)
  • name-node (compulsory)
  • prepare
  • job-xml
  • configuration
  • script (compulsory)
  • param
  • argument
  • file
  • archive

Let us see the example of Pig action for Oozie schema 0.2.

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.2"> ... <action name="myfirstpigjob"> <pig> <job-tracker>foo:8021</job-tracker> <name-node>bar:8020</name-node> <prepare> <delete path="${jobOutput}"/> </prepare> <configuration> <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>oozie.action.external.stats.write</name> <value>true</value> </property> </configuration> <script>/mypigscript.pig</script> <argument>-param</argument> <argument>INPUT=${inputDir}</argument> <argument>-param</argument> <argument>OUTPUT=${outputDir}/pig-output3</argument> </pig> <ok to="myotherjob"/> <error to="errorcleanup"/> </action> ... </workflow-app>


4. FS Action

FS Action is used to perform manipulation of files and directories in HDFS from a workflow application. We can trigger FS commands synchronously within the FS action, once that command is completed, the workflow will start the next action.

FS action supports the following commands.

  • move
  • delete
  • mkdir
  • chmod
  • touchz
  • chgrp

Let us see the example of FS action.

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.4"> ... <action name="hdfscommands"> <fs> <name-node>hdfs://foo:8020</name-node> <job-xml>fs-info.xml</job-xml> <configuration> <property> <name>some.property</name> <value>some.value</value> </property> </configuration> <delete path='/home/cloudduggu/temp-data'/> </fs> <ok to="myotherjob"/> <error to="errorcleanup"/> </action> ... </workflow-app>


5. Sub-Workflow Action

The sub-workflow action triggers a child workflow as part of the parent workflow. In that case, a child workflow job can be presented in a similar Oozie system or it can be in another Oozie system. The parent workflow will only complete when the child workflow complete.

Sub-Workflow action contains the following elements.

  • app-path (compulsory)
  • propagate-configuration
  • configuration

Let us see the example of Sub-Workflow action.

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1"> ... <action name="a"> <sub-workflow> <app-path>child-wf</app-path> <configuration> <property> <name>input.dir</name> <value>${wf:id()}/second-mr-output</value> </property> </configuration> </sub-workflow> <ok to="end"/> <error to="kill"/> </action> ... </workflow-app>


6. Hive Action

Hive action is used to run Hive queries on the cluster. It is SQL alike interface for Hadoop and a very famous tool to work on Hadoop data. The Hive query and the related configuration, libraries, and code for user-defined functions have to be packaged as part of the workflow bundle and deployed to HDFS.

Hive action contains the following elements.

  • job-tracker (compulsory)
  • name-node (compulsory)
  • prepare
  • job-xml
  • configuration
  • script (compulsory)
  • param
  • argument
  • file
  • archive

Let us see the example of Hive action.

<action name=" myHiveAction "> <hive> ... <script>my_script.sql</script> <argument>-hivevar</argument> <argument>InputDir=/home/cloudduggu/input-data</argument> <argument>-hivevar</argument> <argument>OutputDir=${jobOutput}</argument> </hive> </action>


7. DistCp Action

DistCp action is used to support Hadoop distributed copy tool that is used to copy data across the Hadoop cluster. It can be used to copy data in the same cluster as well as move data between Amazon S3 to Hadoop Cluster.

DistCp action contains the following elements.

  • job-tracker (compulsory)
  • name-node (compulsory)
  • prepare
  • configuration
  • java-opts
  • arg

Let us see the example of DistCp action.

<action name=" myDistCpAction "> <distcp> ... <arg>hdfs://localhost:8020/path/to/input.txt</arg> <arg>${nameNode2}/path/to/output.txt</arg> </distcp> </action>


8. Email Action

Email action is used to send email notifications using a workflow application. It takes as usual email parameters such as to, cc, subject, and body of the email.

Email action contains the following elements.

  • to (compulsory)
  • cc
  • subject (compulsory)
  • body (compulsory)

Let us see the example of Email action.

Apart from this, the following SMTP server configuration has to be defined in the oozie-site.xml file for this action to work.

  • oozie.email.smtp.host (default: localhost)
  • oozie.email.smtp.port (default: 25)
  • oozie.email.from.address (default: oozie@localhost)
  • oozie.email.smtp.auth (default: false)
  • oozie.email.smtp.username (default: empty)
  • oozie.email.smtp.password (default: empty)

Let us see the example of Email action.

<action name=" myEmailAction "> <email> <to>support@cloudduggu.com</to> <cc>support@cloudduggu.com</cc> <subject>Email notifications for ${wf:id()}</subject> <body>The wf ${wf:id()} successfully completed.</body> </email> </action>


9. Shell Action

Shell action is used to run all shell script commands. The shell command runs on a random Hadoop cluster node and the commands being run should be available locally on that node.

Shell action contains the following elements.

  • job-tracker (compulsory)
  • name-node (compulsory)
  • prepare
  • job-xml
  • configuration
  • exec (compulsory)
  • argument
  • env-var
  • file
  • archive
  • capture-output

Let us see the example of Shell action.

<action name=" myShellAction "> <shell> ... <exec>${EXEC}</exec> <argument>A</argument> <argument>B</argument> <file>${EXEC}#${EXEC}</file> </shell> </action>


10. SSH Action

SSH action is used to run shell commands on a specified remote host. The command should be executed on the remote machine and user's home directory.

SSH action contains the following elements.

  • host (compulsory)
  • command (compulsory)
  • args
  • arg
  • capture-output

Let us see the example of SSH action.

<action name=" mySSHAction "> <ssh> <host>foo@bar.com<host> <command>uploaddata</command> <args>jdbc:derby://bar.com:1527/myDB</args> <args>hdfs://foobar.com:8020/usr/joe/myData</args> </ssh> </action>


11. Sqoop Action

Sqoop action is used to run Sqoop jobs to import and export data from Hadoop to relational databases and from relational databases to the Hadoop system. Sqoop uses JDBC to talk to external database systems.

Sqoop action contains the following elements.

  • job-tracker (compulsory)
  • name-node (compulsory)
  • prepare
  • job-xml
  • configuration
  • command
  • arg
  • file
  • archive

Let us see the example of Sqoop action.

<action name=" mySqoopAction "> <sqoop> ... <command>import --connect jdbc:hsqldb:file:db.hsqldb --table test_table--target-dir hdfs://localhost:8020/user/joe/sqoop_tbl -m 1 </command> </sqoop> </action>