Recently I’ve tackled the goal to compute HITS algorithm on large graph which represent different types of links among documents. HITS algorithm was first designed and developed by Jon Kleinberg from Cornell University, you can read more details on wiki or in his original paper “Authoritative Sources in a Hyperlinked Environment”.

Briefly the main idea is, given directed graph $G=(V,E)$, we define $B_{in}(v)=\{u | \, (u,v) \in E\}$, all vertices which links to node $v$. Similarly we will define $B_{out}(v)=\{ u | (v,u) \in E\}$, all vertices which node $v$ points to. Having these definitions now we can define recursive relationships to compute authority and hub scores, iteration $n+1$ we be equals to:

  • The “authority” score defined as $a_{n+1}(v)=\frac{1}{ \sqrt{ \sum_{u \in B_{in}(v)}h_n(u)^2}}\sum_{u \in B_{in}(v)}h_{n}(u)$
  • And hub score is $h_{n+1}(v)=\frac{1}{\sqrt{ \sum_{u \in B_{out}(v)}a_n(u)^2}}\sum_{u \in B_{out}}a_{n}(u)$

(we have to normalize weights in order for process to converge)

Now, once we are talking of huge graph, very first thing which came in my mind is to setup Hadoop cluster, model the graph and run computation on it. The fact that HITS computation easily fits model of map-reduce makes it even more preferable approach.

One of my goals was to learn/view the results almost immediately without any additional tooling or code writing, being able to change the model and different parts of computation process. Well, after few hours I’ve spend reading Hadoop manuals I understood that task going to be not that easy as expected, since class which represent graph entity has to implement Writable interface to enable Hadoop framework to serialize or deserialize object into/from HDFS (HaDoop File System). It became even worse when I understood that there is no easy way to view results of execution, since I’ve to write my own viewer of the results to view serialized results in HDFS (I’m not saying this is too complicated or hard, just wanted to make it as easy as possible with writing less code). At this point I’ve realised that MongoDB has its own map-reduce engine, so decided to spent few more minutes to understand how it works and whenever I can leverage from it. After several minutes of reading I’ve understood that this is exactly that I was looking for.

  • I can setup cluster of MongoDB, to be able to process large dataset.
  • Writing map-reduce functionality is simple as just writing two javascript functions.
  • Can use mongo client to connect to database and execute queries in order to see the results.
  • In case I need to adjust something in algorithm or add something there is no need to recompile code (like in case of Hadoop), just rewrite map or reduce function and execute it immediately on MongoDB client.

Here I’ll try to explain that exactly I did and how it works.

Guava is a framework developed by Google to bridge the gap in Java SDK Collections API’s. It introduces quite a lot of different techniques and development patterns which one was expecting to find in regular Collections API, but apparently wasn’t able. I will try to show a few different techniques which helps me to reduce boilerplate code in my projects.

Recently I’ve bumped into the need to have Matlab code running on Amazon cloud infrastructure. Moreover my Matlab application has to communicate with rest of the world, therefore I have had to find some third party communication labrary for Matlab which will do the work. It’s long story why do I need this, but googling around I’ve found series of the blog entries, which explains in details how to do it, basically it also explains the reason why do I need to do it.

  1. Using Amazon EC2 to speed up matlab optimisation
  2. Writing a socket interface in Matlab to send / receive the commands
  3. Setting up an Ubuntu EC2 instance to run the compiled Matlab code
  4. Writing matlab software to communicate with the server.

As you may probably guess were this strightforward procees I’ve never been spending my time to write this blog entry. Since I’ve had to solve a couple of puzzles and tackled with quite a few trick I’d like to memorize my expirience here, hopefully it will save time to more people later.

`—– layout: post title: “Putting all together setup, Hadoop cluster on VM with Puppet.” date: 2013-09-30 00:14 comments: true

categories: hadoop cluster puppet scripts automation virtualization

In my two previous blog post I’ve explained:

  1. Setup Hadoop cluster – manual configuration on virtual machines which needed to setup Hadoop cluster.
  2. Install Java on VM using vagrant and puppet – I’ve started to explain how to leverage vagrant and puppet in order to have Java installed in virtual machine.

Now in this post I’ll try to explain how-to setup hadoop cluster on virtual environment using vagrant and puppet scripts, so all actions in my first post could be easily automated.

During day-to-day, developers facing with deadlines, different project commitments, bug fixing, maitenance and support. There is always not enough time for code improvements and refactoring. I wish I’ll have some free day I could easily practice and improve my development skills without having all this duties on head.

So here it comes again:

On 14th of December there will be Global Day of Code Retreat event, which this year is hosted by Ebay Israel Innovation Center. You can find more on about page and take a look on following video:

Do not miss, highly recommended.

As part of my work I need to be able to test different Java Web applications. I need to be able to run some tests on more then only single computer, hence I need rapid way of setting up several Linux based computers/servers with configured Java running, usually I’m using Ubuntu based destributives.

Therefore I’ve choose the winning couple of Puppet and Vagrant. You’re more than welcome to read about these tools and I’ll skip the installation guide for them, since google has pleanty of different tutorials explaning how to do it for different platform, so you just need to pick the right one for you.

Recently I’ve required to execute some heavy clustering computation on relatively big dataset. Since Mahout (scalable machine learning framework) already has all required capabilities and holds implementation of base clustering algorithm, I’ve decided to use it as a start point and because Mahout is Hadoop based I’ve had to setup cluster of Hadoop nodes to be able to execute my clustering task.

So here I’ll try to memorize steps which required for distributed setup of hadoop cluster, for sake of simplicity I’ll describe setup for only two nodes: master and slave. In this blog post I am going to describe manual install and configuration, while in the next I’ll describe the automation configuration and install using puppet and vagrant tools. I will describe the installation process in context of Ubuntu 12.10 server, while I belive same steps will work for other distirbutives as well.