Uploading custom JARs to Hadoop for PDI

Many PDI applications require third-party Java libraries to perform tasks within PDI jobs and transformations. These libraries must be included in the class path of Hadoop mappers and reducers so PDI applications can use them in the Hadoop cluster. The best way to do this is to copy all dependent JARs to Hadoop’s Distributed Cache and add the following parameters to the Pentaho Map Reduce job step (in the User Defined tab):

  • mapred.cache.files
  • mapred.job.classpath.files

The process of uploading the Custom JARs can be automated by implementing  a PDI transformation that does the following:

  1. Take a list of the JARs the PDI application requires.
  2. Copy all the files from the local filesystem to a configured HDFS dir.
  3. Set a global variable that has a list of all the JARs with the fully qualified path in HDFS. This variable is then used to set the user defined variable given above in the Pentaho MapReduce job step.

You can download this PDI solution here.

This solution has been tested with PDI 4.4.

1 Comment

Filed under Big Data, Hadoop, MapReduce, PDI, Pentaho, Pentaho Data Integration

One response to “Uploading custom JARs to Hadoop for PDI

  1. Pingback: How to join big data sets using MapReduce and PDI | Tech Ramblings

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s