Wednesday, 18 November 2009
Python WTF?
Tuesday, 17 November 2009
I Love Wikipedia
In mathematics, Stirling numbers of the second kind, together with Stirling numbers of the first kind, are one of the two types of Stirling numbers.
iPhone reinstall, contacts and SMS
Friday, 13 November 2009
XSS on new Perl.org — no input filtering??!
Thursday, 12 November 2009
On men, women and feminism
Tuesday, 10 November 2009
Hadoop article
Recently I wrote a Hadoop article in Russian for one of very popular Russian IT blogs. After giving this idea a second thought, I translated this article (or, rather, first part of this article as the second is still in progress) to English and uploaded it to my website (Posterous format isn't very good for such long articles).
Check it here: http://romankirillov.info/hadoop.html (and don't be mad for my clumsy English!)
P. S. in case you can read Russian: http://sigizmund.habrahabr.ru/blog/74792/
Monday, 9 November 2009
Building histograms using only MySQL

The query means a very simple thing: select count of matching rows, value of locations - grouped by locations field; rpad does all the magic - it basically says to add a certain number of '*' on the right of empty string, number of '*' is count(*). It's been divided by 15 to fit the bar into the console screen. That's it, chaps!
Friday, 6 November 2009
More Hadoop Mysteries - order of initialisation
Hey out there! Still not tired of my Hadoop experiments? Not yet? That’s another one for you! What’d you think the difference is between two snippets of code? Say, this: … and this: No difference, you say? Not quite right, sir: the difference is that whatever you do to (and I spent a couple of hours trying to understand why distributed cache works properly in one app and doesn’t work at all in another). So you know now. Be warned.SomeCodeWhichChangesConfig.initialise(getConf());
Job job = new Job(conf, "MyHadoopJob");
// ... setting the job details if (!job.waitForCompletion(true))
{
System.err.println("FAILED, cannot continue");
} Job job = new Job(conf, "MyHadoopJob");
// ... setting the job details SomeCodeWhichChangesConfig.initialise(getConf());
if (!job.waitForCompletion(true))
{
System.err.println("FAILED, cannot continue");
} conf
after creating a job will have no further effect. That is, Job
constructor apparently copies all the data and doesn’t link your copy of Configuration
object with it’s copy. Brilliant, no?
Wednesday, 4 November 2009
XML input and Hadoop – custom InputFormat
Today I finally hit the task I was scared for so long — processing large XML files on Hadoop. I won’t tell you for how long I crawled the Internet trying to find some working solution… not that anyone wants to know? Eventually, I came out with the solution of my own — even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car. To make things more simple, I won’t include the full source code. I won’t even include the whole InputFormat class. So, to make yourself comfortable, please do following: Now, we’re almost there. Now I’ll include the piece of code for WTF — you will say? It’s the same code? Well — yes, and no. It’s almost the same. Take a look at this line: and this line: This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won’t add anything else. Now, it’s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits: There’re few limitations of this approach. One of them is that if the document contains something like LineRecordReader
from org.apache.hadoop.mapreduce.lib.input
so you can see itTextInputFormat
from the same package.nextKeyValue()
which turned out to be the most critical method here. Hold on tight:public boolean nextKeyValue() throws IOException
{
StringBuilder sb = new StringBuilder();
if (key == null)
{
key = new LongWritable();
}
key.set(pos);
if (value == null)
{
value = new Text();
}
int newSize = 0; boolean xmlRecordStarted = false;
Text tmpLine = new Text(); while (pos < end)
{
newSize = in.readLine(tmpLine,
maxLineLength,
Math.max((int)
Math.min(Integer.MAX_VALUE,
end - pos),
maxLineLength)); if (newSize == 0)
{
break;
} if (tmpLine.toString().contains("<document "))
{
xmlRecordStarted = true;
} if (xmlRecordStarted)
{
sb.append(tmpLine.toString().replaceAll("n", " "));
} if (tmpLine.toString().contains("</document>"))
{
xmlRecordStarted = false;
this.value.set(sb.toString());
break;
} pos += newSize; } if (newSize == 0)
{
key = null;
value = null;
return fal
se;
}
else
{
return true;
}
}if (tmpLine.toString().contains("<document"))
if (tmpLine.toString().contains("</document>"))
<document
and </document>
strings to anything else (and again, I will do it tomorrow, but now I feel too lazy).</document><document>
it obviously won’t work. Another is — you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into Writable
-compatible class).