Thursday, 17 September 2009

Debugging Hadoop applications using your Eclipse

Well, it can be annoying - it can be awfully annoying, in fact, to debug Hadoop applications. But sometimes you need it, because logging doesn't show anything, and you've tried anything but still cannot get under the Hadoop's cover. In this case, do few simple steps.



1. Download and unpack Hadoop to your local machine. 

2. Prepare small set of data you're planning to run the test on

3. Check that you actually can run Hadoop locally, something like this (don't forget to set $HADOOP_CLASSPATH first!): 


bin/hdebug jar yourprogram.jar com.company.project.HadoopApp

          tiny.txt ./out


4. Go to Hadoop's directory, and copy file bin/hadoop to bin/hdebug

5. Now, we need to make Hadoop start in debug mode. What you should do is to add one line of text into the starting script:





Yes, here's it. Copy it from here:



-Xdebug -Xrunjdwp:transport=dt_socket,address=8001,server=y,suspend=y



What does it say basically is an instruction to Java to start in debug mode, and wait for socket connection of the remote debugger on port 8001; execution should be suspended after the start until debugger is connected.


Now, go and start your grid application like you did in step 3, but now use bin/hdebug script we've created. If you've done everything correctly, program should output something like this:


Listening for transport dt_socket at address: 8001


and wait for debugger. So, let's get it some debugger then! Fire up your Eclipse with your project (likely you have it opened already since you're trying to debug something) and add new Debug configuration:





After you've set everything up, click "Apply" and close the window for now – probably, you'd want to set some breakpoints before starting the actual debugging. Go and do it, and then simply choose created debug configuration - and off you go! If everything worked properly, you should soon get a standard debugger window, with all the nice things Java can offer you. Hope it'll help some of us in our difficult business of writing distributed grid-enabled applications! :)