Wednesday, 18 November 2009

Python WTF?

>>> for i in range (0, 10):
...     hash(i)
... 
0
1
2
3
4
5
6
7
8
9
>>> hash(123324)
123324
>>> hash(785345345436845768)
785345345436845768
>>> 

WTF?

Tuesday, 17 November 2009

I Love Wikipedia

In mathematics, Stirling numbers of the second kind, together with Stirling numbers of the first kind, are one of the two types of Stirling numbers.

Isn't it a truly beautiful definition?

iPhone reinstall, contacts and SMS

I'll make a long story short today — I did a "new iPhone"-style firmware update, i.e. without restoring everything from the backup. iPhone started to work like a dream on 3.1.2. But I've lost few things which I didn't want to lose — in particular, my SMS messages. 

I found a backup of my iPhone which was done yesterday. Next I found a brilliant utility named iPhone Backup Extractor. I'll tell you what — normally I do not buy paid versions of software if the free version is doing what I can. This app doesn't have a paid version, so I just went and donated €5.00 to author — it's a brilliant piece of technology, dead simple and works. All what it does is just extracts  the data from the backup file(s) (there're zillions of them in iPhone Backup directory) in form they can be placed to the iPhone. Well - that's what I did. PhoneView allowed to copy and paste these files where they belong — and the next time I fired up my "Messages" application, I was pleasantly surprised by each and every SMS since I've purchased my iPhone.

Dear Mr Padraig Kennedy — you have saved my day. And I appreciate it. My best regards to you and to what you're doing.




Friday, 13 November 2009

XSS on new Perl.org — no input filtering??!

Apparently, guys from Perl should really learn how to filter the input (maybe they closed this hole right now, but at the moment of writing the link produces following picture):


Thursday, 12 November 2009

On men, women and feminism

Often when I am reading articles like this, I want to ask myself a question: why some people, who are (presumably) sane and educated, fail to understand one tiny little thing – men and women are indeed different. No matter how hard feminists will try, they will not change it, ever. I agree that sexism is bad. I agree that there should be equal opportunities for men and women. Furthermore, I strongly believe, that people are born equal, irrespective of their gender, skin colour, religion or habits. 

However, when I hear something "this can be offending to women!" I want to say only one thing: in most of the cases it is offending not to women, but to one particular woman (and, this woman was actively seeking for a way to get offended and shout about it). Why lots of women are happy to live their lives - and be happy in their way, rather than fight for some miraculous women rights? Megan Fox told one very good thing (in that article): "...women should be empowered by it, not degraded". And I think that's a brilliant true - we're born equal, yet different and it would be a very miserable world if there would be no difference between the man and the woman. 

(also, which is my personal opinion and I don't mean to offend anyone, but if woman acts like man, looks like man, speaks like man, and requires to be treated like man, then there's something clearly wrong with this woman).

Tuesday, 10 November 2009

Hadoop article

Recently I wrote a Hadoop article in Russian for one of very popular Russian IT blogs. After giving this idea a second thought, I translated this article (or, rather, first part of this article as the second is still in progress) to English and uploaded it to my website (Posterous format isn't very good for such long articles).

Check it here: http://romankirillov.info/hadoop.html

(and don't be mad for my clumsy English!)


 


P. S. in case you can read Russian: http://sigizmund.habrahabr.ru/blog/74792/

Monday, 9 November 2009

Building histograms using only MySQL


The query means a very simple thing: select count of matching rows, value of locations - grouped by locations field; rpad does all the magic - it basically says to add a certain number of '*' on the right of empty string, number of '*' is count(*). It's been divided by 15 to fit the bar into the console screen. That's it, chaps!

Friday, 6 November 2009

More Hadoop Mysteries - order of initialisation


Hey out there! Still not tired of my Hadoop experiments? Not yet? That’s another one for you!


What’d you think the difference is between two snippets of code? Say, this:


SomeCodeWhichChangesConfig.initialise(getConf()); 
Job job = new Job(conf, "MyHadoopJob");

// ... setting the job details

if (!job.waitForCompletion(true))
{
System.err.println("FAILED, cannot continue");
}


… and this:


Job job = new Job(conf, "MyHadoopJob"); 

// ... setting the job details

SomeCodeWhichChangesConfig.initialise(getConf());
if (!job.waitForCompletion(true))
{
System.err.println("FAILED, cannot continue");
}


No difference, you say? Not quite right, sir: the difference is that whatever you do to conf after creating a job will have no further effect. That is, Job constructor apparently copies all the data and doesn’t link your copy of Configuration object with it’s copy. Brilliant, no?


(and I spent a couple of hours trying to understand why distributed cache works properly in one app and doesn’t work at all in another). So you know now. Be warned.


Wednesday, 4 November 2009

XML input and Hadoop – custom InputFormat


Today I finally hit the task I was scared for so long — processing large XML files on Hadoop. I won’t tell you for how long I crawled the Internet trying to find some working solution… not that anyone wants to know? Eventually, I came out with the solution of my own — even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car.


To make things more simple, I won’t include the full source code. I won’t even include the whole InputFormat class. So, to make yourself comfortable, please do following:



  1. Open LineRecordReader from org.apache.hadoop.mapreduce.lib.input so you can see it

  2. Open TextInputFormat from the same package.

  3. Create the input format and record reader of your own, just by copying and pasting the code from aforementioned classes.

  4. Change the constructor of your input format class so it’ll return your newly-defined record reader.


Now, we’re almost there. Now I’ll include the piece of code for nextKeyValue() which turned out to be the most critical method here. Hold on tight:


public boolean nextKeyValue() throws IOException
{
StringBuilder sb = new StringBuilder();
if (key == null)
{
key = new LongWritable();
}
key.set(pos);
if (value == null)
{
value = new Text();
}
int newSize = 0;

boolean xmlRecordStarted = false;
Text tmpLine = new Text();

while (pos < end)
{
newSize = in.readLine(tmpLine,
maxLineLength,
Math.max((int)
Math.min(Integer.MAX_VALUE,
end - pos),
maxLineLength));

if (newSize == 0)
{
break;
}

if (tmpLine.toString().contains("<document "))
{
xmlRecordStarted = true;
}

if (xmlRecordStarted)
{
sb.append(tmpLine.toString().replaceAll("n", " "));
}

if (tmpLine.toString().contains("</document>"))
{
xmlRecordStarted = false;
this.value.set(sb.toString());
break;
}

pos += newSize;

}

if (newSize == 0)
{
key = null;
value = null;
return fal
se
;
}
else
{
return true;
}
}


WTF — you will say? It’s the same code? Well — yes, and no. It’s almost the same. Take a look at this line:


if (tmpLine.toString().contains("<document")) 

and this line:


if (tmpLine.toString().contains("</document>")) 

This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won’t add anything else.


Now, it’s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits:



  1. It uses very little custom code (you remember, we copied and pasted all the classes?). Unfortunately you cannot just inherit the class — some fields are private, and we clearly want to modify them.

  2. It’s configurable — you can easily change the <document and </document> strings to anything else (and again, I will do it tomorrow, but now I feel too lazy).

  3. It works.


There’re few limitations of this approach. One of them is that if the document contains something like </document><document> it obviously won’t work. Another is — you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into Writable-compatible class).



Have fun!


Update: As you can see, I have added a space in "<document " string constant – today I realised that "<documenttype" elements has been successfully used for splits, hence producing inconsistent results.

Huey colour calibration, pink colours and my Macs

I've got a thingie, you know... this colour calibration thingie named Huey. I got a while ago as a birthday gift from my friends, but when I switched to Mac two years ago it suddenly stopped to work (well it did but it produced that horrible pink cast – I couldn't use it and I didn't want to). Yet, a couple of weeks ago I decided that this not gonna work this way and sent a request to Pantone to do something about it. 


So, after some checks they sent me a replacement. I tried it at home and - yes, the same pink cast. Completely devastated, I decided to take it to work and try with my 24" HP monitor (connected to my MacBook).

I took it. And I tried. And it did work!!! Now my MacBook monitor and HP monitor look if not the same, but very similar at least. Actually, I did a very simple trick to make a non-pro version of Huey to work for two monitors: apparently, Pantone guys didn't know you can change the primary monitor (and it calibrates only primary monitor) just by dragging the menu bar in monitors' settings. I did it twice and it worked like a dream!

Seriously, colours are not perfect. I can see it even though probably none of my colleagues will see any difference after working with calibrated monitor for couple of hours. But – it's much better than original non-calibrated colours, and it makes me happy.

One day I'll buy a semi-pro Eye One Display, but until then I'll live with my Huey.