`—– layout: post title: “Putting all together setup, Hadoop cluster on VM with Puppet.” date: 2013-09-30 00:14 comments: true

categories: hadoop cluster puppet scripts automation virtualization

In my two previous blog post I’ve explained:

  1. Setup Hadoop cluster – manual configuration on virtual machines which needed to setup Hadoop cluster.
  2. Install Java on VM using vagrant and puppet – I’ve started to explain how to leverage vagrant and puppet in order to have Java installed in virtual machine.

Now in this post I’ll try to explain how-to setup hadoop cluster on virtual environment using vagrant and puppet scripts, so all actions in my first post could be easily automated.

I’m going to use my first two posts as a reference and assume you have downloaded all files needed, moreover you already have puppet module script for java install and configuration.

Start as usual:

1. Create module folders:

Terminal Widow
1
mkdir -p modules/hadoop/{files,manifests,templates}

2. Copy hadoop files into module folder:

As I said I’m going to use files from previous post

Terminal Window
1
cp hadoop-1.2.1.tar.gz modules/hadoop/files/

3. Write implementation of hadoop module for puppet.

Following script will download, unpack and setup hadoop on desired amount of virtual machines.

init.pp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class hadoop( $masterNode, $slaveNodes, $distrFile, $hadoopHome) {

    Exec {
        path => [ "/usr/bin", "/bin", "/usr/sbin"]
    }

    file { "/tmp/${distrFile}.tar.gz" :
        source => "puppet:///modules/hadoop/${distrFile}.tar.gz",
        owner =>  vagrant,
        mode  => 755,
        ensure => present
    }

    exec { "extract distr":
        cwd => "/tmp",
        command => "tar xf ${distrFile}.tar.gz",
        creates => "${hadoopHome}",
        user => vagrant,
        require => File["/tmp/${distrFile}.tar.gz"]
    }

    exec { "move distr":
        cwd => "/tmp",
        command => "mv ${distrFile} ${hadoopHome}",
        creates => "${hadoopHome}",
        user => vagrant,
        require => Exec["extract distr"]
    }

    file { "/etc/profile.d/hadoop.sh":
        content => "export HADOOP_PREFIX=\"${hadoopHome}\"
        export PATH=\"\$PATH:\$HADOOP_PREFIX/bin\""
    }

    file { "${hadoopHome}/conf/slaves":
        content => template( "hadoop/slaves.erb"),
        mode => 644,
        owner => vagrant,
        group => vagrant,
        require => Exec[ "move distr"]
    }

    file { "${hadoopHome}/conf/masters":
        content => template( "hadoop/masters.erb"),
        mode => 644,
        owner => vagrant,
        group => vagrant,
        require => Exec[ "move distr"]
    }

    file { "${hadoopHome}/conf/core-site.xml":
        content => template( "hadoop/core-site.erb"),
        mode => 644,
        owner => vagrant,
        group => vagrant,
        require => Exec[ "move distr"]
    }

    file { "${hadoopHome}/conf/hdfs-site.xml":
        content => template( "hadoop/hdfs-site.erb"),
        mode => 644,
        owner => vagrant,
        group => vagrant,
        require => Exec[ "move distr"]
    }

    file { "${hadoopHome}/conf/mapred-site.xml":
        content => template( "hadoop/mapred-site.erb"),
        mode => 644,
        owner => vagrant,
        group => vagrant,
        require => Exec[ "move distr"]
    }
}

4. Write puppet templates for configuration files:

You probably paid attention that in order to setup hadoop configuration files I’ve used puppet templates function (you can read more here). Therefore now we need to provide three configuration files in order to complete that part.

masters
1
<% @masterNode %>
slaves
1
2
3
<% @slaveNodes.each do |node| -%>
<% node %>
<% end -%>
core-site.erb
1
2
3
4
5
6
7
8
9
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>fs.default.name</name>
            <value><%= @masterNode %>:9000</value>
            <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.</description>
        </property>
    </configuration>
hdfs-site.erb
1
2
3
4
5
6
7
8
9
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
        <description>The actual number of replications can be specified when the file is created.</description>
    </property>
</configuration>
mapred-site.erb
1
2
3
4
5
6
7
8
9
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value><% @masterNode %>:9001</value>
        <description>The host and port that the MapReduce job tracker runs at.</description>
    </property>
</configuration>

Where:

5. Next step is to write Vagrant file.

Vagrant
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Vagrant::Config.run do |config|
  config.vm.define :hadoop_master do |hadoop_config|
        hadoop_config.vm.box = "ubuntu"
        hadoop_config.vm.network :hostonly, "192.168.32.1"
        hadoop_config.vm.host_name = "master"
        hadoop_config.vm.customize [ "modifyvm", :id, "--memory", "1024"]
        hadoop_config.vm.provision :puppet, :facter => { "fqdn" => "master"} do |puppet|
                puppet.module_path = "modules"
                puppet.manifests_path = "manifests"
                puppet.manifest_file  = "hadoop.pp"
        end
  end

  config.vm.define :hadoop_slave do |hadoop_config|
        hadoop_config.vm.box = "ubuntu"
        hadoop_config.vm.network :hostonly, "192.168.32.2"
        hadoop_config.vm.host_name = "slave"
        hadoop_config.vm.customize [ "modifyvm", :id, "--memory", "1024"]
        hadoop_config.vm.provision :puppet, :facter => { "fqdn" => "slave"} do |puppet|
                puppet.module_path = "modules"
                puppet.manifests_path = "manifests"
                puppet.manifest_file  = "hadoop.pp"
        end
  end
end

6. Now we need to write some code to combine java installation process with hadoop setup.

Following script will initialize java defaults and hadoop installation.

hadoop.pp
1
2
3
4
5
6
7
class { "hadoop":
    masterNode => "192.168.32.1",
    slaveNodes => ["192.168.32.1", "192.168.32.2"],
    distrFile  => "hadoop-1.2.1",
    hadoopHome => "/home/vagrant/hadoop"
}
include java

7. Run virtual machines.

Terminal Window
1
vagrant up hadoop_{master,slave}

Now if we proceed to the last step in hadoop post and will try to run command on master node:

Terminal Window
1
2
sudo ./hadoop namenode -format
sudo ./start-all.sh

We will notice that we have forgot to configure and publish ssh keys between nodes, hence cluster is not able to properly startup. Therefore we need to continue a bit more.

8. Setup ssh keys.

Run in terminal:

Terminal Window
1
ssh-keygen -t rsa

and provide key path up to your files module folder (“modules/hadoop/files/id_rsa” for example). Enter your passphrase, now open new generated id_rsa.pub and copy into clipboard newly generated public key. Now we need to add following lines into our module init.pp file:

init.pp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
    file { "/home/vagrant/.ssh/id_rsa":
        source => "puppet:///modules/hadoop/id_rsa",
        mode => 600,
        owner => vagrant,
        group => vagrant,
    }

    file { "/home/vagrant/.ssh/id_rsa.pub":
        source => "puppet:///modules/hadoop/id_rsa.pub",
        mode => 600,
        owner => vagrant,
        group => vagrant,
    }

    ssh_authorized_key { "ssh_key":
        ensure => "present",
        key => "past here content from your clipboard (public key value from id_rsa.pub file)",
        type => "ssh-rsa",
        user => vagrant,
        require => File["/home/vagrant/.ssh/id_rsa.pub"]
    }

9. Enjoy.

Now you can reload virtual machines

Terminal Window
1
vagrant reload hadoop_{master,slave}

then loggin into master node

Terminal Window
1
vagrant ssh hadoop_master

and finally start cluster

Terminal Window
1
2
sudo ./hadoop namenode -format
sudo ./start-all.sh

You can find sources on my github.