What is Apache Oozie Workflow?

Apache Oozie workflow is a Hadoop Job. It is a collection of action and controls nodes that are arranged in a DAG. DAG is a directed acyclic graph (DAG) that captures control dependency in which each action represents a Hadoop job, Pig, Hive, Sqoop, or Hadoop DistCp job. Apart from Hadoop jobs, there are other actions such as Java application, a shell script, or email notification.

The node's order in workflow decides the execution order of these actions. Any new action will not start unless the previous action is completed. Workflow’s control nodes manage the execution flow of actions. Workflow start and end is defined by the control nodes start and end. The fork and join control nodes help to execute actions in parallel. The decision control node works as a switch/case statement that selects a particular execution path within the workflow by using job information.

Apache Oozie workflows are directed acyclic graphs and hence they don’t support loop.

The following figure represents an example of a workflow.

cloudduggu oozie workflow figure


Apache Oozie Workflow Definition

Apache Oozie workflow definition is a DAG (directed acyclic graph) and control flow nodes such as (start, end, decision, fork, join, kill) or action nodes (map-reduce, pig, etc.). The definition of Workflow language is built on XML. It is also called hPDL.

Apache Oozie does not support cycles in workflow definitions, workflow definitions strictly follow DAG.


Apache Oozie Workflow Nodes

Apache Oozie workflow nodes are categories into two nodes as mentioned below.

  1. Control flow nodes
  2. Action nodes

Let us see each node in detail.


1. Control flow nodes

Control flow nodes are used to control the start and end of the workflow and workflow job execution path. It defines the beginning and the end of a workflow such as a start, end, and kill nodes. It also provides a way to control the workflow execution path such as the decision node, fork node, and join node.


2. Action nodes

Action flow nodes are used to perform the processing in the workflow.

We will see nodes in detail in the next section.