Jan 28, 2011

Setting up Pseudo-Distributed Apache Hadoop 0.21.0 in 10 minutes

I'm providing this as collateral material for my Hadoop Presentation at Data Day Austin. Firstly, Pseudo-Distributed mode is effectively a 1 node Hadoop Cluster setup. This is really the best way to get started with Hadoop as it makes it really easy to modify the config to be fully distributed once you've got a handle on the basics. This is also a good developer setup. You'll notices in some of the the *-site config files that are modified, the path values I provide are pathed off of my Hadoop install directory. This is because I have several different installations of Hadoop running on one machine.

Ready? Here we go:

1) Setup Passwordless SSH:
$ ssh-keygen -t dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Test that you can ssh without password:
$ ssh localhost

2) Download Hadoop and Untar it in your desired directory (make sure you user has permission to this directory)
$ tar -xf hadoop-0.21.0.tar.gz

3) Uncomment and set JAVA_HOME in the $HADOOP_HOME/conf/hadoop-env.sh


4) Insert the following XML between <configuration> tags in the $HADOOP_HOME/conf/core-site.xml 


<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
</property>

5) Insert the following XML between <configuration> tags in the $HADOOP_HOME/conf/mapred-site.xml 



For the mapred.system.dir , create a $HADOOP_HOME/tmp directory and specify your own path

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
</property>
  
<property>
  <name>mapred.system.dir</name>
  <value>/Users/stevewatt/hadoop-0.21.0/tmp/system</value>
  <description>The shared directory where MapReduce stores control files.
  </description>
</property> 

<property>
  <name>mapred.job.tracker.http.address</name>
  <value>0.0.0.0:50030</value>
  <description>
    The job tracker http server address and port the server will listen on.
    If the port is 0 then the server will start on a free port.
  </description>
</property>

<property>
  <name>mapred.task.tracker.http.address</name>
  <value>0.0.0.0:51060</value>
  <description>
    The task tracker http server address and port.
    If the port is 0 then the server will start on a free port.
  </description>
</property>

6) Insert the following XML between <configuration> tags in the $HADOOP_HOME/conf/hdfs-site.xml 

For the the properties that reflect my own personal path , create a $HADOOP_HOME/dfs directory and specify your own path
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/Users/stevewatt/hadoop-0.21.0/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/Users/stevewatt/hadoop-0.21.0/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>

<property>
  <name>dfs.secondary.http.address</name>
  <value>0.0.0.0:51090</value>
  <description>
    The secondary namenode http server address and port.
    If the port is 0 then the server will start on a free port.
  </description>
</property>

<property>
  <name>dfs.datanode.address</name>
  <value>0.0.0.0:51010</value>
  <description>
    The address where the datanode server will listen to.
    If the port is 0 then the server will start on a free port.
  </description>
</property>

<property>
  <name>dfs.datanode.http.address</name>
  <value>0.0.0.0:51075</value>
  <description>
    The datanode http server address and port.
    If the port is 0 then the server will start on a free port.
  </description>
</property>

<property>
  <name>dfs.datanode.ipc.address</name>
  <value>0.0.0.0:51020</value>
  <description>
    The datanode ipc server address and port.
    If the port is 0 then the server will start on a free port.
  </description>
</property>

<property>
  <name>dfs.http.address</name>
  <value>0.0.0.0:50070</value>
  <description>
    The address and the base port where the dfs namenode web ui will listen on.
    If the port is 0 then the server will start on a free port.
  </description>
</property>

<property>
  <name>dfs.datanode.https.address</name>
  <value>0.0.0.0:51475</value>
</property>

<property>
  <name>dfs.https.address</name>
  <value>0.0.0.0:51470</value>
</property>

7) Change to the $HADOOP_HOME directory and format the Namenode

$ bin/hadoop namenode -format

8) Start Hadoop

$ bin/start-all.sh

9) Open a browser and confirm that Hadoop is RUNNING on http://localhost:50030 and localhost:50070

10) Optionally, you can run the 3 step Hadoop TeraSort Job to System Test the cluster:

$bin/hadoop jar hadoop-mapred-examples-0.21.0.jar teragen 10000 in-dir
$bin/hadoop jar hadoop-mapred-examples-0.21.0.jar terasort in-dir out-dir
$bin/hadoop jar hadoop-mapred-examples-0.21.0.jar teravalidate out-dir report-dir

No comments: