Apache Oozie Action Nodes

What are Apache Oozie Action Nodes?

Apache Oozie action nodes are used to define the jobs which are the separate units of work that are attached to make up the Oozie workflow. By using action nodes a workflow starts the execution of a computation/processing task of various types of Hadoop jobs.

Action Definition

Actions can be defined in the workflow XML by using a set of elements that are specific and relevant to that action type. Some of the elements are very common in all the action types such as the Pig action would require a "script" element, but at the same time, Java action will not. A workflow system can be customized for Hadoop that makes it easy and natural for users to define all these actions which are destined for processing various Hadoop tools.

Action Types

The following action nodes are supported by Apache Oozie.

Let us see each action type in detail.

1. MapReduce Action

The Map-Reduce action is used to start the Hadoop Map-Reduce job from a workflow. It can be configured to perform file system cleanup and directory creation before starting the map-reduce job. To run Hadoop/Map-Reduce jobs, we need to configure all the Hadoop JobConf properties.

The following is an order which should be considered while writing the action definition in workflows.

job-tracker (compulsory)
name-node (compulsory)
prepare
streaming or pipes
job-xml
configuration
file
archive

1.1 Streaming

Streaming jobs detail can be mentioned in the streaming element as these jobs run binaries or scripts and require executables. Some of the streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts.

Streaming jobs supports the following elements.

mapper
reducer
record-reader
record-reader-mapping
env

Let us see the example of streaming.



    <workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
        ...
        <action name="firstjob">
            <map-reduce>
                <job-tracker>foo:8021</job-tracker>
                <name-node>bar:8020</name-node>
                <prepare>
                    <delete path="${output}"/>
                </prepare>
                <streaming>
                    <mapper>/bin/bash testarchive/bin/mapper.sh testfile</mapper>
                    <reducer>/bin/bash testarchive/bin/reducer.sh</reducer>
                </streaming>
                <configuration>
                    <property>
                        <name>mapred.input.dir</name>
                        <value>${input}</value>
                    </property>
                    <property>
                        <name>mapred.output.dir</name>
                        <value>${output}</value>
                    </property>
                    <property>
                        <name>stream.num.map.output.key.fields</name>
                        <value>3</value>
                    </property>
                </configuration>
                <file>/home/cloudduggu/testfile.sh#testfile</file>
                <archive>/home/cloudduggu/testarchive.jar#testarchive</archive>
            </map-reduce>
            <ok to="end"/>
            <error to="kill"/>
        </action>
        ...
    </workflow-app>

1.2 Pipes

Pipes are used to running C++ programs more gracefully. Although it is not very famous. A user-defined program must be bundled with the workflow application. Certain pipe jobs need the file to be present on HDFS for mapper/reducer scripts and this is accomplished using file and archive elements. We can overrule Pipe properties by mentioning them in the job-XML file or by mentioning them in the configuration element.

Pipes jobs support the following elements.

map
reduce
inputformat
partitioner
writer
program

Let us see the example of Pipes.



    <workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
        ...
        <action name="firstjob">
            <map-reduce>
                <job-tracker>foo:8021</job-tracker>
                <name-node>bar:8020</name-node>
                <prepare>
                    <delete path="${output}"/>
                </prepare>
                <streaming>
                    <mapper>/bin/bash testarchive/bin/mapper.sh testfile</mapper>
                    <reducer>/bin/bash testarchive/bin/reducer.sh</reducer>
                </streaming>
                <configuration>
                    <property>
                        <name>mapred.input.dir</name>
                        <value>${input}</value>
                    </property>
                    <property>
                        <name>mapred.output.dir</name>
                        <value>${output}</value>
                    </property>
                    <property>
                        <name>stream.num.map.output.key.fields</name>
                        <value>3</value>
                    </property>
                </configuration>
                <file>/home/cloudduggu/testfile.sh#testfile</file>
                <archive>/home/cloudduggu/testarchive.jar#testarchive</archive>
            </map-reduce>
            <ok to="end"/>
            <error to="kill"/>
        </action>
        ...
    </workflow-app>

2. Java Action

Java action is used to run custom java code on Hadoop Cluster. It will execute the public static void main(String[] args) method of the specified main Java class. The Java applications are executed in the Hadoop cluster as a map-reduce job with a single Mapper task. A java action can be configured to perform HDFS files/directories clean up or HCatalog partitions clean up before starting the Java application. This helps Oozie to reinitiate a Java application in the condition of a transient or non-transient failure. Before starting the Java application, a java action can be configured to perform HDFS files/directories cleanup or HCatalog partitions cleanup.

Java action contains the following elements.

job-tracker (compulsory)
name-node (compulsory)
prepare
configuration
main-class (compulsory)
java-opts
arg
file
archive
capture-output

Let us see the example of Java action.



    <workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
        ...
        <action name="myfirstjavajob">
            <java>
                <job-tracker>foo:8021</job-tracker>
                <name-node>bar:8020</name-node>
                <prepare>
                    <delete path="${jobOutput}"/>
                </prepare>
                <configuration>
                    <property>
                        <name>mapred.queue.name</name>
                        <value>default</value>
                    </property>
                </configuration>
                <main-class>org.apache.oozie.MyFirstMainClass</main-class>
                <java-opts>-Dblah</java-opts>
                <arg>argument1</arg>
                <arg>argument2</arg>
            </java>
            <ok to="myotherjob"/>
            <error to="errorcleanup"/>
        </action>
        ...
    </workflow-app>

3. Pig Action

Pig action executes a Pig job in Hadoop. Pig uses a Latin programming language to run Hadoop jobs. Pig translates Pig script into MapReduce jobs for Hadoop. Before we start a Pig job, we can configure pig action to perform HDFS files/directories cleanup. This helps Oozie to reinitiate a Pig job in the condition of a transient failure.

Pig action contains the following elements.

scrjob-tracker (compulsory)
name-node (compulsory)
prepare
job-xml
configuration
script (compulsory)
param
argument
file
archive

Let us see the example of Pig action for Oozie schema 0.2.



    <workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.2">
        ...
        <action name="myfirstpigjob">
            <pig>
                <job-tracker>foo:8021</job-tracker>
                <name-node>bar:8020</name-node>
                <prepare>
                    <delete path="${jobOutput}"/>
                </prepare>
                <configuration>
                    <property>
                        <name>mapred.compress.map.output</name>
                        <value>true</value>
                    </property>
                    <property>
                        <name>oozie.action.external.stats.write</name>
                        <value>true</value>
                    </property>
                </configuration>
                <script>/mypigscript.pig</script>
                <argument>-param</argument>
                <argument>INPUT=${inputDir}</argument>
                <argument>-param</argument>
                <argument>OUTPUT=${outputDir}/pig-output3</argument>
            </pig>
            <ok to="myotherjob"/>
            <error to="errorcleanup"/>
        </action>
        ...
    </workflow-app>

4. FS Action

FS Action is used to perform manipulation of files and directories in HDFS from a workflow application. We can trigger FS commands synchronously within the FS action, once that command is completed, the workflow will start the next action.

FS action supports the following commands.

move
delete
mkdir
chmod
touchz
chgrp

Let us see the example of FS action.



    <workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.4">
        ...
        <action name="hdfscommands">
            <fs>
                <name-node>hdfs://foo:8020</name-node>
                <job-xml>fs-info.xml</job-xml>
                <configuration>
                    <property>
                        <name>some.property</name>
                        <value>some.value</value>
                    </property>
                </configuration>
                <delete path='/home/cloudduggu/temp-data'/>
            </fs>
            <ok to="myotherjob"/>
            <error to="errorcleanup"/>
        </action>
        ...
    </workflow-app>

5. Sub-Workflow Action

The sub-workflow action triggers a child workflow as part of the parent workflow. In that case, a child workflow job can be presented in a similar Oozie system or it can be in another Oozie system. The parent workflow will only complete when the child workflow complete.

Sub-Workflow action contains the following elements.

app-path (compulsory)
propagate-configuration
configuration

Let us see the example of Sub-Workflow action.



    <workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
        ...
        <action name="a">
            <sub-workflow>
                <app-path>child-wf</app-path>
                <configuration>
                    <property>
                        <name>input.dir</name>
                        <value>${wf:id()}/second-mr-output</value>
                    </property>
                </configuration>
            </sub-workflow>
            <ok to="end"/>
            <error to="kill"/>
        </action>
        ...
    </workflow-app>

6. Hive Action

Hive action is used to run Hive queries on the cluster. It is SQL alike interface for Hadoop and a very famous tool to work on Hadoop data. The Hive query and the related configuration, libraries, and code for user-defined functions have to be packaged as part of the workflow bundle and deployed to HDFS.

Hive action contains the following elements.

job-tracker (compulsory)
name-node (compulsory)
prepare
job-xml
configuration
script (compulsory)
param
argument
file
archive

Let us see the example of Hive action.



    <action name=" myHiveAction ">
        <hive>
            ...
            <script>my_script.sql</script>
            <argument>-hivevar</argument>
            <argument>InputDir=/home/cloudduggu/input-data</argument>
            <argument>-hivevar</argument>
            <argument>OutputDir=${jobOutput}</argument>
        </hive>
    </action>

7. DistCp Action

DistCp action is used to support Hadoop distributed copy tool that is used to copy data across the Hadoop cluster. It can be used to copy data in the same cluster as well as move data between Amazon S3 to Hadoop Cluster.

DistCp action contains the following elements.

job-tracker (compulsory)
name-node (compulsory)
prepare
configuration
java-opts
arg

Let us see the example of DistCp action.



    <action name=" myDistCpAction ">
        <distcp>
            ...
            <arg>hdfs://localhost:8020/path/to/input.txt</arg>
            <arg>${nameNode2}/path/to/output.txt</arg>
        </distcp>
    </action>

8. Email Action

Email action is used to send email notifications using a workflow application. It takes as usual email parameters such as to, cc, subject, and body of the email.

Email action contains the following elements.

to (compulsory)
cc
subject (compulsory)
body (compulsory)

Let us see the example of Email action.

Apart from this, the following SMTP server configuration has to be defined in the oozie-site.xml file for this action to work.

oozie.email.smtp.host (default: localhost)
oozie.email.smtp.port (default: 25)
oozie.email.from.address (default: oozie@localhost)
oozie.email.smtp.auth (default: false)
oozie.email.smtp.username (default: empty)
oozie.email.smtp.password (default: empty)

Let us see the example of Email action.



    <action name=" myEmailAction ">
        <email>
            <to>support@cloudduggu.com</to>
            <cc>support@cloudduggu.com</cc>
            <subject>Email notifications for ${wf:id()}</subject>
            <body>The wf ${wf:id()} successfully completed.</body>
        </email>
    </action>

9. Shell Action

Shell action is used to run all shell script commands. The shell command runs on a random Hadoop cluster node and the commands being run should be available locally on that node.

Shell action contains the following elements.

job-tracker (compulsory)
name-node (compulsory)
prepare
job-xml
configuration
exec (compulsory)
argument
env-var
file
archive
capture-output

Let us see the example of Shell action.



    <action name=" myShellAction ">
        <shell>
            ...
            <exec>${EXEC}</exec>
            <argument>A</argument>
            <argument>B</argument>
            <file>${EXEC}#${EXEC}</file>
        </shell>
    </action>

10. SSH Action

SSH action is used to run shell commands on a specified remote host. The command should be executed on the remote machine and user's home directory.

SSH action contains the following elements.

host (compulsory)
command (compulsory)
args
arg
capture-output

Let us see the example of SSH action.



    <action name=" mySSHAction ">
        <ssh>
            <host>foo@bar.com<host>
            <command>uploaddata</command>
            <args>jdbc:derby://bar.com:1527/myDB</args>
            <args>hdfs://foobar.com:8020/usr/joe/myData</args>
        </ssh>
    </action>

11. Sqoop Action

Sqoop action is used to run Sqoop jobs to import and export data from Hadoop to relational databases and from relational databases to the Hadoop system. Sqoop uses JDBC to talk to external database systems.

Sqoop action contains the following elements.

job-tracker (compulsory)
name-node (compulsory)
prepare
job-xml
configuration
command
arg
file
archive

Let us see the example of Sqoop action.



    <action name=" mySqoopAction ">
        <sqoop>
            ...
            <command>import --connect jdbc:hsqldb:file:db.hsqldb --table test_table--target-dir
                hdfs://localhost:8020/user/joe/sqoop_tbl -m 1
            </command>
        </sqoop>
    </action>