Hadoop article

Recently I wrote a Hadoop article in Russian for one of very popular Russian IT blogs. After giving this idea a second thought, I translated this article (or, rather, first part of this article as the second is still in progress) to English and uploaded it to my website (Posterous format isn't very good for such long articles).

Check it here: http://romankirillov.info/hadoop.html

(and don't be mad for my clumsy English!)

 

P. S. in case you can read Russian: http://sigizmund.habrahabr.ru/blog/74792/

Filed under  //   hadoop  

Comments [0]

More Hadoop Mysteries - order of initialisation

Hey out there! Still not tired of my Hadoop experiments? Not yet? That’s another one for you!

What’d you think the difference is between two snippets of code? Say, this:

SomeCodeWhichChangesConfig.initialise(getConf()); 
Job job = new Job(conf, "MyHadoopJob");

// ... setting the job details

if (!job.waitForCompletion(true))
{
System.err.println("FAILED, cannot continue");
}

… and this:

Job job = new Job(conf, "MyHadoopJob"); 

// ... setting the job details

SomeCodeWhichChangesConfig.initialise(getConf());
if (!job.waitForCompletion(true))
{
System.err.println("FAILED, cannot continue");
}

No difference, you say? Not quite right, sir: the difference is that whatever you do to conf after creating a job will have no further effect. That is, Job constructor apparently copies all the data and doesn’t link your copy of Configuration object with it’s copy. Brilliant, no?

(and I spent a couple of hours trying to understand why distributed cache works properly in one app and doesn’t work at all in another). So you know now. Be warned.

Filed under  //   geek   hadoop  

Comments [0]

XML input and Hadoop – custom InputFormat

Today I finally hit the task I was scared for so long — processing large XML files on Hadoop. I won’t tell you for how long I crawled the Internet trying to find some working solution… not that anyone wants to know? Eventually, I came out with the solution of my own — even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car.

To make things more simple, I won’t include the full source code. I won’t even include the whole InputFormat class. So, to make yourself comfortable, please do following:

  1. Open LineRecordReader from org.apache.hadoop.mapreduce.lib.input so you can see it
  2. Open TextInputFormat from the same package.
  3. Create the input format and record reader of your own, just by copying and pasting the code from aforementioned classes.
  4. Change the constructor of your input format class so it’ll return your newly-defined record reader.

Now, we’re almost there. Now I’ll include the piece of code for nextKeyValue() which turned out to be the most critical method here. Hold on tight:

public boolean nextKeyValue() throws IOException
{
StringBuilder sb = new StringBuilder();
if (key == null)
{
key = new LongWritable();
}
key.set(pos);
if (value == null)
{
value = new Text();
}
int newSize = 0;

boolean xmlRecordStarted = false;
Text tmpLine = new Text();

while (pos < end)
{
newSize = in.readLine(tmpLine,
maxLineLength,
Math.max((int)
Math.min(Integer.MAX_VALUE,
end - pos),
maxLineLength));

if (newSize == 0)
{
break;
}

if (tmpLine.toString().contains("<document "))
{
xmlRecordStarted = true;
}

if (xmlRecordStarted)
{
sb.append(tmpLine.toString().replaceAll("\n", " "));
}

if (tmpLine.toString().contains("</document>"))
{
xmlRecordStarted = false;
this.value.set(sb.toString());
break;
}

pos += newSize;

}

if (newSize == 0)
{
key = null;
value = null;
return false;
}
else
{
return true;
}
}

WTF — you will say? It’s the same code? Well — yes, and no. It’s almost the same. Take a look at this line:

if (tmpLine.toString().contains("<document")) 

and this line:

if (tmpLine.toString().contains("</document>")) 

This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won’t add anything else.

Now, it’s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits:

  1. It uses very little custom code (you remember, we copied and pasted all the classes?). Unfortunately you cannot just inherit the class — some fields are private, and we clearly want to modify them.
  2. It’s configurable — you can easily change the <document and </document> strings to anything else (and again, I will do it tomorrow, but now I feel too lazy).
  3. It works.

There’re few limitations of this approach. One of them is that if the document contains something like </document><document> it obviously won’t work. Another is — you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into Writable-compatible class).

Have fun!

Update: As you can see, I have added a space in "<document " string constant – today I realised that "<documenttype" elements has been successfully used for splits, hence producing inconsistent results.

Filed under  //   geek   hadoop   mapreduce   xml  

Comments [10]

Debugging Hadoop applications using your Eclipse

Well, it can be annoying - it can be awfully annoying, in fact, to debug Hadoop applications. But sometimes you need it, because logging doesn't show anything, and you've tried anything but still cannot get under the Hadoop's cover. In this case, do few simple steps.

1. Download and unpack Hadoop to your local machine. 
2. Prepare small set of data you're planning to run the test on
3. Check that you actually can run Hadoop locally, something like this (don't forget to set $HADOOP_CLASSPATH first!): 

bin/hdebug jar yourprogram.jar com.company.project.HadoopApp \
          tiny.txt ./out

4. Go to Hadoop's directory, and copy file bin/hadoop to bin/hdebug
5. Now, we need to make Hadoop start in debug mode. What you should do is to add one line of text into the starting script:

Yes, here's it. Copy it from here:

-Xdebug -Xrunjdwp:transport=dt_socket,address=8001,server=y,suspend=y

What does it say basically is an instruction to Java to start in debug mode, and wait for socket connection of the remote debugger on port 8001; execution should be suspended after the start until debugger is connected.

Now, go and start your grid application like you did in step 3, but now use bin/hdebug script we've created. If you've done everything correctly, program should output something like this:

Listening for transport dt_socket at address: 8001

and wait for debugger. So, let's get it some debugger then! Fire up your Eclipse with your project (likely you have it opened already since you're trying to debug something) and add new Debug configuration:

After you've set everything up, click "Apply" and close the window for now – probably, you'd want to set some breakpoints before starting the actual debugging. Go and do it, and then simply choose created debug configuration - and off you go! If everything worked properly, you should soon get a standard debugger window, with all the nice things Java can offer you. Hope it'll help some of us in our difficult business of writing distributed grid-enabled applications! :)


Filed under  //   geek   hadoop   work  

Comments [0]

About

A mad-eye programmer. No, really!