Saturday, December 26, 2009

installing hadoop on ubuntu karmic

Mixing and matching a couple of guides, I've installed a local hadoop instance on my netbook. Here are my notes from the install process.

I'll refer to the guides by number later. Doc 1 is the current #1 hit for 'ubuntu hadoop' on google, so it seemed a good spot to start.

Documents:

  1. http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
  2. http://archive.cloudera.com/docs/_apt.html
  3. http://github.com/spazm/config/tree/master/hadoop/conf/

1) created a hadoop user and group, as per document 1. Also ssh-key for hadoop user. (currently no-password, will check that soon).

2) added jaunty-testing repo from cloudera, see doc 2. They don't have a jaunty package yet. Add /etc/apt/souces.list.d/cloudera.list


#deb http://archive.cloudera.com/debian karmic-testing contrib
#deb-src http://archive.cloudera.com/debian karmic-testing contrib
#no packages for karmic yet, trying jaunty-testing, jaunty-stable, jaunty-cdh1 or jaunty-cdh2
deb http://archive.cloudera.com/debian jaunty-testing contrib
deb-src http://archive.cloudera.com/debian jaunty-testing contrib

3) install hadoop:

[andrew@mini]% sudo aptitude install hadoop                                                                   0 ~/src
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Reading extended state information      
Initializing package states... Done
"hadoop" is a virtual package provided by:
  hadoop-0.20 hadoop-0.18 
You must choose one to install.
No packages will be installed, upgraded, or removed.
0 packages upgraded, 0 newly installed, 0 to remove and 25 not upgraded.
Need to get 0B of archives. After unpacking 0B will be used.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Reading extended state information      
Initializing package states... Done

3b) sudo aptitude update, sudo aptitude install hadoop-0.20

[andrew@mini]% sudo aptitude install hadoop-0.20                                                              0 ~/src
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Reading extended state information      
Initializing package states... Done
The following NEW packages will be installed:
  hadoop-0.20 hadoop-0.20-native{a} 
0 packages upgraded, 2 newly installed, 0 to remove and 25 not upgraded.
Need to get 20.1MB of archives. After unpacking 41.9MB will be used.
Do you want to continue? [Y/n/?] Y
Writing extended state information... Done
[... snip ...]
Initializing package states... Done
Writing extended state information... Done
4) this has setup our config information in /etc/hadoop-0.20, also symlinked as /etc/hadoop/
hadoop-env.sh is loaded from /etc/hadoop/conf/hadoop-env.sh (aka /etc/hadoop-0.20/conf.empty/hadoop-envb.sh)

Modify hadoop-env.sh to point to our jvm. Since I installed sun java 1.6 (aka Java6), I updated it to: export JAVA_HOME=/usr/lib/jvm/java-6-sun

5) update rest of configs.
Snapshotted conf.empty to ~/config/hadoop/conf, and started making edits, as per doc 1. Symlinked into /etc/hadoop/conf

files available at document #3, my github config project, hadoop/conf subidr.

6) switch to hadoop user
sudo -i -u hadoop

7) initiale hdfs (as hadoop user)
mkdir ~hadoop/tmp
chmod a+rwx ~hadoop/tmp
hadoop namenode -format

8) fire it up: (as hadoop user)

/usr/lib/hadoop/bin/start-all.sh
hadoop@mini:/usr/lib/hadoop/logs$ /usr/lib/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
hadoop@mini:/usr/lib/hadoop/logs$ /usr/lib/hadoop/bin/start-all.sh
starting namenode, logging to /usr/lib/hadoop/bin/../logs/hadoop-hadoop-namenode-mini.out
localhost: starting datanode, logging to /usr/lib/hadoop/bin/../logs/hadoop-hadoop-datanode-mini.out
localhost: starting secondarynamenode, logging to /usr/lib/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-mini.out
starting jobtracker, logging to /usr/lib/hadoop/bin/../logs/hadoop-hadoop-jobtracker-mini.out localhost: starting tasktracker, logging to /usr/lib/hadoop/bin/../logs/hadoop-hadoop-tasktracker-mini.out

8) Check that it is running via jps
hadoop@mini:/usr/lib/hadoop/logs$ jps
12001 NameNode
12166 DataNode
12684 Jps
12568 TaskTracker
12409 JobTracker
12332 SecondaryNameNode

(note to self, why don't we have hadoop completion in zsh? Must rectify)

9) Run example. See doc 1:
hadoop jar hadoop-0.20.0-examples.jar wordcount gutenberg gutenberg-output

hadoop@mini:~/install$ hadoop jar hadoop-0.20.1+152-examples.jar wordcount gutenberg gutenberg-output
09/12/25 23:24:19 INFO input.FileInputFormat: Total input paths to process : 3
09/12/25 23:24:20 INFO mapred.JobClient: Running job: job_200912252310_0001
09/12/25 23:24:21 INFO mapred.JobClient:  map 0% reduce 0%
09/12/25 23:24:33 INFO mapred.JobClient:  map 66% reduce 0%
09/12/25 23:24:39 INFO mapred.JobClient:  map 100% reduce 0%
09/12/25 23:24:42 INFO mapred.JobClient:  map 100% reduce 33%
09/12/25 23:24:48 INFO mapred.JobClient:  map 100% reduce 100%
09/12/25 23:24:50 INFO mapred.JobClient: Job complete: job_200912252310_0001
...

hadoop@mini:~/install$ hadoop dfs -ls gutenberg-output
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2009-12-25 23:24 /user/hadoop/gutenberg-output/_logs
-rw-r--r--   1 hadoop supergroup      21356 2009-12-25 23:24 /user/hadoop/gutenberg-output/part-r-00000

It Lives!

No comments: