Oozie is a workflow scheduler system that manages Apache Hadoop jobs.
Oozie’s system operates by running the workflows of dependent jobs and permits users to create Directed Acyclic Graphs of workflows. These DAG’s can be run in parallel and sequentially in Hadoop.
This workflow scheduler system consists of two parts:
- Workflow engine: Responsibility of a workflow engine is to store and run workflows composed of Hadoop jobs. This includes, MapReduce, Pig and Hive.
- Coordinator engine: It runs workflow jobs based on predefined schedules and availability of data.
Oozie operates by running as a service in a Hadoop cluster with clients submitting workflow definitions for immediate or delayed processing.
Oozie workflow consists of action nodes and control-flow nodes.
An action node is a workflow task, which could be moving files into HDFS. While, a control-flow node controls the workflow execution between actions by allowing constructs like conditional logic, where it allows for more actions to follow depending on the result of earlier action nodes.
Oozie can be integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box
Features of Oozie include:
- Having a client API and command line interface which can be used to launch, control and monitor jobs from Java applications.
- Using its Web Service APIs one can control jobs from anywhere.
- Having provisions to execute jobs which are scheduled to run periodically.
- Having provision to send email notifications upon completion of jobs.
In Data Defined, we help make the complex world of data more accessible by explaining some of the most complex aspects of the field. Click Here for more Data Defined.