Hadoop can efficiently process large amounts of data by splitting large files into smaller chunks (typically 128 MB) and process each chunk in parallel. The “splittability” of a file is central to the efficient handling of the file by MapReduce. A file format is splittable if you can start processing records anywhere in the file. A splittable file cannot have a file structure that requires you to start reading from the first line of a file. Therefore XML and JSON documents that span multiple lines are not splittable. You cannot just start processing an XML document in the middle because you need the starting and ending tags of the XML hierarchy. There are three possible ways to handle these types of files in Hadoop:
- Store the XML in HBase an column. The process the HBase table, extract the column that has the XML, and process the the XML as needed in a Mapper. The main downside of this solution is that it forces you to use HBase. If you decide to use this method, you can use PDI’s HBase Row Decoder step to get the HBase data into a Pentaho MapReduce (See: this posting to see how to use this step in PMR )
- Remove all line breaks from doc and create a single line that contains the entire XML/JSON data as a single record. If using the a text based input formatter for the mapper/reducer, then you can strip all the line breaks from that XML doc or JSON object so that the entire XML doc or JSON object appears as a single string record in the text file. You then put the single line in an HDFS file. You put multiple XML docs or JSON objects in separate lines in the same HDFS file. When your mapper function receives the record, it will get the entire XML doc or JSON obj as a single line. Although I have seen this solution implemented, I do not recommend this solution because if your XML/JSON has data that requires line breaks that are not encoded then you will loose the line break formatting (i.e. you may have this issue if using CDATA XML tag).
- Use a custom input formatter. This is probably the most popular solution. This requires you to implement Hadoop’s Java classes for writing custom input formatter and record reader classes. The Apache Mahout projects has one of the most widely used implementations of this solution (see Mahout’s XmlInputFormat implementation). Although this is probably the most popular method of processing XML files in MapReduce application, I find this not to be the most efficient because you effectively have to parse the XML document twice: once to get the entire doc as a single record that gets passed to your mapper, and a second time in you mapper when your mapper code will probably parse it for further processing.
- Store the data in binary/hex format. In this solution you would take the text data and put in a binary (byte array) format that can be stored in a single line. Then write binary format as a single line of string to an HDFS file. You put multiple “stringified” binary representations of XML/JSON text in separate lines in the same HDFS file. Then in the mapper you reverse the binary string format to the same textual string format. This will preserve all the formatting and special characters that your XML/JSON contains.
The last option is preferred method because it does not have the issues of the first few options. The last option is also very easy to implement in PDI. If you are writing your application in Java, you could implement it using Hadoop’s sequence file format. However, the current release of PDI cannot natively write out to a sequence file without. You could write custom Java code in PDI to write out to a Hadoop sequence file, but there is an easier way to get the same benefits of without using sequence files.
The attached sample PDI application demonstrates how to process non-splittable text formats such as XML and JSON in a MapReduce application. This example pre-processes XML docs into a hex string that can be written out to a single line. Each line can then be read into a mapper as a single record and converted back to XML and processed using PDI’s XML parsers. It uses a Java Expression PDI step to convert between XML string and its hex representation.
You can download this PDI solution here: xml_processing
You should start looking at the PDI Job implemented in master_job.kjb. This PDI job performs the following tasks:
- Configures variable for Hadoop and various other parameters used in the application. You will need to configure the Hadoop cluster as needed in the Set Variables job entry.
- Calls the store_binary_hdfs transformation that will read all the XML docs in the data directory, convert them to Hex strings, and write them out to a single HDFS files.
- Runs a Pentaho MapReduce (map only) application to parse the XML and extract the wanted fields to a CSV file that is sent to the mapper output.
The sample app has been tested with the following software:
- Pentaho Data Integration 5.2
- Cloudera CDH5.1