XML input and Hadoop – custom InputFormat
Today I finally hit the task I was scared for so long — processing large XML files on Hadoop. I won’t tell you for how long I crawled the Internet trying to find some working solution… not that anyone wants to know? Eventually, I came out with the solution of my own — even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car. To make things more simple, I won’t include the full source code. I won’t even include the whole InputFormat class. So, to make yourself comfortable, please do following: Now, we’re almost there. Now I’ll include the piece of code for WTF — you will say? It’s the same code? Well — yes, and no. It’s almost the same. Take a look at this line: and this line: This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won’t add anything else. Now, it’s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits: There’re few limitations of this approach. One of them is that if the document contains something like
LineRecordReader from org.apache.hadoop.mapreduce.lib.input so you can see itTextInputFormat from the same package.nextKeyValue() which turned out to be the most critical method here. Hold on tight:public boolean nextKeyValue() throws IOException
{
StringBuilder sb = new StringBuilder();
if (key == null)
{
key = new LongWritable();
}
key.set(pos);
if (value == null)
{
value = new Text();
}
int newSize = 0; boolean xmlRecordStarted = false;
Text tmpLine = new Text(); while (pos < end)
{
newSize = in.readLine(tmpLine,
maxLineLength,
Math.max((int)
Math.min(Integer.MAX_VALUE,
end - pos),
maxLineLength)); if (newSize == 0)
{
break;
} if (tmpLine.toString().contains("<document "))
{
xmlRecordStarted = true;
} if (xmlRecordStarted)
{
sb.append(tmpLine.toString().replaceAll("\n", " "));
} if (tmpLine.toString().contains("</document>"))
{
xmlRecordStarted = false;
this.value.set(sb.toString());
break;
} pos += newSize; } if (newSize == 0)
{
key = null;
value = null;
return false;
}
else
{
return true;
}
}if (tmpLine.toString().contains("<document")) if (tmpLine.toString().contains("</document>"))
<document and </document> strings to anything else (and again, I will do it tomorrow, but now I feel too lazy).</document><document> it obviously won’t work. Another is — you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into Writable-compatible class).
Comments (10)
Just found your post. I've been trying to do the same thing. In the end, I went with the XmlInputFormat from Mahout's Bayesian Classifier. Seems to do everything I need it to (and works without going screwy like the streaming one).
I posted about it here: http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html


