Apache Oozie Introduction

What is Apache Oozie?

Apache Oozie is a workflow scheduler system to manage Hadoop jobs for Apache MapReduce, Pig, Hive, and Sqoop. It is designed to run multistage Hadoop jobs as a single job which is called an Oozie job. Oozie jobs can be configured to run on-demand jobs or periodically jobs. Oozie on-demand jobs are called workflow jobs and Oozie periodically jobs are called coordinator jobs. Oozie provides another type of job called bundle job, bundle job is a collection of coordinator jobs that are managed as a single job. By using Oozie, Hadoop administrators can build complex data transformations that can combine the processing of different individual tasks and even sub-workflows. This ability allows for greater control over complex jobs and makes it easier to repeat those jobs at prearranged periods.


How does Apache Oozie work?

Apache Oozie workflows definitions are written in hPDL (an XML Process Definition Language) which is a collection of actions such as (Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Directed Acyclic Graph), which means until unless first job complete second job will not start. Apache Oozie workflow actions start jobs in remote systems such as Hadoop and Pig and when that action is completed Oozie starts another job in the workflow.

Oozie workflows have two types of nodes first one is control flow nodes and the second one action nodes.

    Control Flow Nodes

    Oozie control flow nodes are used to define the start and end of a workflow for the start, end, and fail nodes also provide a mechanism to control the workflow execution path for a decision, fork, and join nodes.

    Action Nodes

    Oozie action nodes are used workflow to start the execution of a task. Oozie supports different types of actions such as Hadoop map-reduce, Hadoop file system, Pig, SSH, HTTP, eMail, and Oozie sub-workflow, apart from this Oozie supports addition to the type of actions which we will discuss in a later section.

In the following figure, Oozie runs the MapReduce job in which if MapReduce job completes successfully then workflow job ends normally but if MapReduce job fails to execute correctly then Oozie will kill the workflow.

cloudduggu oozie workflow


How Apache Oozie Name was invented?

Alejandro wanted to give a name that indicates the nature of the system. During Oozie development time Alejandro was in India and wanted to keep a Hindi name for elephant keeper, mahout but that was already taken by Apache Mahout Project. After looking into multiple names Oozie name was given. Oozie is the Burmese word for elephant keeper.

Apache Oozie History

Let us see a year-by-year evaluation of Apache Oozie.

    2008: Alejandro Abdelnur and few engineers of Yahoo! started working on a system that can run multistage Hadoop jobs and in a month the first version of Oozie was developed.

    2010: Yahoo! open-sourced Oozie.

    2011: Oozie was given to the Apache Incubator.

    2012: Apache Foundation named it a top-level project.


Apache Oozie Features

Apache Oozie provides the following features.

  • Apache Oozie executes the workflow of Hadoop jobs.
  • The application can launch their jobs and monitor it as well using the Apache Oozie command-line interface and client API.
  • Apache Oozie provides Web service APIs through which jobs can be managed from any place.
  • It is used to Trigger's execution of data availability.
  • Apache Oozie facilitates the easy console to manage jobs such as CMD, HTTP, and web console.
  • Apache Oozie notifies the user over email upon completion of jobs.

Apache Oozie Advantages

The following are some of the advantages of Apache Oozie.

  • Apache Oozie uses an appropriate and good programming model to ease its acceptance and reduce the burden of the developer.
  • A user can easily troubleshoot the failed jobs and rerun them.
  • Apache Oozie can be extensible and provide supports from different types of jobs as well.
  • It provides a multitenant service to reduce the cost of operation.
  • A user can run the concurrent jobs using Apache Oozie workflow schedular.
  • It can run jobs in a server to increase reliability.

Apache Oozie Usecase

Apache Oozie is widely used by many organizations to schedule Hadoop Jobs. Yahoo! is a major user of Oozie. It has one of the largest deployments of Hadoop, with more than 40,000 nodes across several clusters. Apache Oozie is the primary workflow engine for Hadoop clusters at Yahoo!. Apache Ozzie starts the 72% of Hadoop Jobs in Yahoo! as a stats of Jan 2015.

The following are some stats of Yahoo!.

  • Busiest cluster:
    • 1 million+ workflow per month.
    • 45 - 55K workflows per day.
    • 40 - 50K coord actions per day.
    • 800 - 900 coordinators (5m, 15m, 30m, hourly, daily, and weekly).
    • 30 - 40 bundles.

  • Most complex bundle:
    • 230 coordinators.

  • Most complex workflow:
    • 85 forks.

  • Video Transcoding:
    • 100-300 workflows per min.