Monthly Archives: October 2013

How to get Hadoop counters in PDI applications

Hadoop maintains various counters about MapReduce jobs in the JobTracker. These counters can be viewed using the JobTracker’s web UI but are not easily accessible within a PDI application. This solution shows how you can retrieve the Hadoop counters from a PDI applications.  The attached solution implements the word count MapReduce application using Pentaho MapReduce. Once the Penatho MapReduce is complete, it collects and logs all the Hadoop counters. The solution contains the following files/folders:

  1. data – directory that contains sample files that will used for word count. These files will be copied to HDFS for performing the word count MapReduce application.
  2. wordcount.kjb – This is the main PDI application that coordinates all the steps required to perform the word count. It performs the following tasks:
    1. Creates a unique Hadoop Job Name  for our MapReduce application.
    2. Sets variables to configure the Hadoop cluster.
    3. Executes the word count MapReduce application using Pentaho MapReduce.
    4. Retrieves the Hadoop counters and logs them to Kettle’s log file.
  3. wc_mapper.ktr – Transformation that implements the map phase of the word count application.
  4. wc_reducee.ktr – Transformation that implements the reduce phase of the word count application.
  5. capture_hadoop_metrics.ktr – This transformation collects all the Hadoop counters in a User Defined Java Class and outputs the information to logs. The UDJC uses Hadoop’s native API to retrieve the counters and output a single row for each counter.

In order for this solution to work, the following requirements must be met:

  1. Hadoop job has not been retired. Hadoop’s Java API only only retrieves job and counter data for non-retired jobs.
  2. Hadoop Job Name must be unique. Information about a Hadoop job is retrieved using the PMR Job Name. This name must be unique for all non-retired Hadoop jobs.
  3. PDI uses Hadoop’s Java API to retrieve the counters. Therefore all required Hadoop libraries (JARs) must be copied to the appropriate Pentaho application lib directory. For example, if you are using CDH4.x  and run the sample app in Spoon 5.x, then copy all of Hadoop client libs from [PentahoHome]/design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh42/lib/client/* to [PentahoHome]/design-tools/data-integration/lib.

You can download this PDI solution here: get_hadoop_metrics.tar.gz

The sample app has been tested with the following software:

  1. Pentaho Data Integration 5.x
  2. Cloudera  CDH4.3


Filed under Big Data, Hadoop, MapReduce, PDI, Pentaho, Pentaho Data Integration