When implementing MapReduce application using Pentaho PDI, the mapper and reducer can only send data out to a single destination directory when using the Hadoop Output step. The mapper outputs data is sent to a temporary directory on the node where it is executing and is only temporary. The reducer outputs to the output directory in HDFS that is configured in the Pentaho MapReduce job entry.
PDI does not support Hadoop’s native features that provides an output formatter that allows you to write MapReduce output to multiple destinations. However, the following method can be used to write output data to multiple destinations within a mapper or a reducer.
- In the mapper or reducer, use the Pentaho Hadoop File Output step (or Text Output step) to write data to files in HDFS. These files will be in addition to the standard mapper/reducer output files that are generated by the Hadoop Output step.
- The directory that the Hadoop File Output step stores the file under cannot be the same directory as the output directory configured in the MapReduce job entry.
- You need to programmatically create a unique name for the file if the Hadoop File Output step in all the mapper and reducers are writing to the same directory. You can have multiple mapper and reducers executing at the same time in your cluster, so you have to make sure each instance of the mapper or reducer creates a unique file in the HDFS name space. Only one mapper or reducer can write to an HDFS file at the same time.
The diagram below is an example KTR that functions as a reducer to the Pentaho MapReduce job:
- The Filter Rows step is used to send some rows to the standard Hadoop Output step and other rows to the Hadoop File Output step.
- The Hadoop Output step will send the data to HDFS directory that is configured as the output directory in the Pentaho MapReduce job entry (PDI Job not shown).
- The Hadoop File Output step creates another file in HDFS to store all rows less than 10. The actual filename contains timestamp that has a milliseconds resolution so no two mappers will create the same filename in the same HDFS dir (note the resolution of the timestamp is milliseconds).
The above design can be much more powerful then the native Hadoop API because PDI has connector to many technologies. You can use this design to write out to NoSQL data store such as Cassandra on MongoDB and publish rows to a Kafka bus.