Recently I’ve required to execute some heavy clustering computation on relatively big dataset. Since Mahout (scalable machine learning framework) already has all required capabilities and holds implementation of base clustering algorithm, I’ve decided to use it as a start point and because Mahout is Hadoop based I’ve had to setup cluster of Hadoop nodes to be able to execute my clustering task.
So here I’ll try to memorize steps which required for distributed setup of hadoop cluster, for sake of simplicity I’ll describe setup for only two nodes: master and slave. In this blog post I am going to describe manual install and configuration, while in the next I’ll describe the automation configuration and install using puppet and vagrant tools. I will describe the installation process in context of Ubuntu 12.10 server, while I belive same steps will work for other distirbutives as well.
Here is the steps required for Hadoop install and configuration in order to be able to execute distributed tasks on cluster nodes:
Now for next few steps assume we have two computer with IP addresses as follow: 192.168.17.1 (master) and 192.168.17.2 (slave).
Open file $HADOOP_HOME/conf/core-site.xml and write content:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>fs.default.name</name><value>hdfs://192.168.17.1:9000</value><description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.</description></property></configuration>
Next open $HADOOP_HOME/conf/hdfs-site.xml and write content:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>dfs.replication</name><value>1</value><description>The actual number of replications can be specified when the file is created.</description></property></configuration>
Now open $HADOOP_HOME/conf/mapred-site.xml and write content:
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>mapred.job.tracker</name><value>192.168.17.1:9001</value><description>The host and port that the MapReduce job tracker runs at.</description></property></configuration>
And finally you need to change two more files first $HADOOP_HOME/conf/masters and second is $HADOOP_HOME/conf/slaves, not too hard to guess what should be contet of each one of these:
I’ve putted 192.168.17.1 in both files, since I’d like to have master node to execute computational task as well and hold distributed data.
Now we proceed to the final steps of ssh configuration and actuall Hadoop startup.
Here you can find more details and explanations on how to configure and setup Hadoop cluster.
Obviously it’s ridiculous to proceed all these steps each time I need to setup new Hadoop cluster, so in my next blog post I’ll write how-to setup Hadoop cluster using vagrant and puppet to enable automation of this procedure.