Tuesday, January 5, 2010

Hadoop Streaming - running a job

Logs from running a hadoop streaming job in a freshly set up environment
  1. copy input files
  2. run streaming jar (now located in contrib/streaming)
  3. hope.

1) copy input files to dfs:
hadoop dfs -copyFromLocal examples/wordcount wordcount

2) run streaming job:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+152-streaming.jar \
-input wordcount \
-output wordcountout \
-mapper examples/wordcount/map.pl \
-reducer examples/wordcount/reduce.pl \

packageJobJar: [/home/hadoop/tmp/hadoop-hadoop/hadoop-unjar5876487782773207253/] [] /tmp/streamjob4555454909817451366.jar tmpDir=null
10/01/05 03:29:34 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/05 03:29:35 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/tmp/hadoop-hadoop/mapred/local]
10/01/05 03:29:35 INFO streaming.StreamJob: Running job: job_201001050303_0003
10/01/05 03:29:35 INFO streaming.StreamJob: To kill this job, run:
10/01/05 03:29:35 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201001050303_0003
10/01/05 03:29:35 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001050303_0003
10/01/05 03:29:36 INFO streaming.StreamJob: map 0% reduce 0%
10/01/05 03:30:19 INFO streaming.StreamJob: map 100% reduce 100%
10/01/05 03:30:19 INFO streaming.StreamJob: To kill this job, run:
10/01/05 03:30:19 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201001050303_0003
10/01/05 03:30:19 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001050303_0003
10/01/05 03:30:19 ERROR streaming.StreamJob: Job not Successful!
10/01/05 03:30:19 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Sigh. Failure. Time to debug the job. Definitely needs the -file flag to bundle the executables to the remote machine.
-file map.pl
-file reducer.pl

Step back and run a simpler job.
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+152-streaming.jar -input wordcount -output wordcountout3 -mapper /bin/cat -reducer /bin/wc


10/01/05 03:36:17 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001050303_0004
10/01/05 03:36:17 ERROR streaming.StreamJob: Job not Successful!
10/01/05 03:36:17 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

This is the example code. Why did it fail again? Is there a problem with my input file?

I'll pick this back up tomorrow (or later). A little birdie just told me it was 3:40am. Well past my bedtime. I'd like to finish this up, but that's what I've been saying since I started at midnight. I have the Hadoop::Streaming::Mapper and ::Reducer modules all packaged up and ready to make an initial push to CPAN, but first I need to get an example running under hadoop. I did finish writing the tests for the non-hadoop case, and those are clean and ready. Feel free to follow along at my github repository.

Update

or maybe I'll just try a few more times...

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+152-streaming.jar -input wordcount -output wordcountout7 -mapper map.pl -reducer reduce.pl -file examples/wordcount/map.pl -file examples/wordcount/reduce.pl


packageJobJar: [examples/wordcount/map.pl, examples/wordcount/reduce.pl, /home/hadoop/tmp/hadoop-hadoop/hadoop-unjar390944251948922559/] [] /tmp/streamjob7610913425753318391.jar tmpDir=null
10/01/05 03:59:11 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/05 03:59:11 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/tmp/hadoop-hadoop/mapred/local]
10/01/05 03:59:11 INFO streaming.StreamJob: Running job: job_201001050303_0010
10/01/05 03:59:11 INFO streaming.StreamJob: To kill this job, run:
10/01/05 03:59:11 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201001050303_0010
10/01/05 03:59:11 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001050303_0010
10/01/05 03:59:12 INFO streaming.StreamJob: map 0% reduce 0%
10/01/05 03:59:23 INFO streaming.StreamJob: map 100% reduce 0%
10/01/05 03:59:35 INFO streaming.StreamJob: map 100% reduce 100%
10/01/05 03:59:38 INFO streaming.StreamJob: Job complete: job_201001050303_0010
10/01/05 03:59:38 INFO streaming.StreamJob: Output: wordcountout7

Success!


hadoop dfs -ls wordcountout7;

Found 2 items
drwxr-xr-x - hadoop supergroup 0 2010-01-05 03:59 /user/hadoop/wordcountout7/_logs
-rw-r--r-- 1 hadoop supergroup 125 2010-01-05 03:59 /user/hadoop/wordcountout7/part-00000


hadoop dfs -cat wordcountout7/part*

apple 2
bar 2
baz 1
c 1
c++ 2
cpan 9
foo 2
haskell 4
lang 1
lisp 1
ocaml 2
orange 2
perl 9
python 1
ruby 4
scheme 1
search 1

7 comments:

nicolas said...

you saved my day!

Andrew Grangaard said...

Nicolas,

I'm glad I was able to save your day! I'd love to hear more about what you're working on, what's your hadoop project? Are you using my Hadoop::Streaming perl module?

shrish said...

Hi,

Whenever I am trying to use Java class files as my mapper and/or reducer I am getting the following error:

java.io.IOException: Cannot run program "MapperTst.class": java.io.IOException: error=2, No such file or directory

I executed the following command on the terminal:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -file /home/hadoop/codes/MapperTst.class -mapper /home/hadoop/codes/MapperTst.class -file /home/hadoop/codes/ReducerTst.class -reducer /home/hadoop/codes/ReducerTst.class -input gutenberg/* -output gutenberg-outputtstch27

Please let me know if I am going wrong.

Regards

Shrish

Andrew Grangaard said...

Shrish,

When you use -file to include a file in your jar, the file will be placed in the working directory. Use a local path to access it.

e.g.
-file /home/hadoop/codes/MapperTst.class
-file /home/hadoop/codes/ReducerTst.class
-mapper MapperTst.class
-reducer ReducerTst.class

Anonymous said...

Thanks you save my day too

VijuPoonthottam said...

I installed hadoop in usr/local/hadoop
and I copied range_mapper.py ,range_reducer.py , and two required files

My hadoop system contains these files

hadoop fs -ls /user/hduser/map
Warning: $HADOOP_HOME is deprecated.

Found 5 items
drwxr-xr-x - hduser supergroup 0 2012-10-21 16:16 /user/hduser/map/a-output
-rw-r--r-- 1 hduser supergroup 146 2012-10-21 14:08 /user/hduser/map/input
-rw-r--r-- 1 hduser supergroup 4 2012-10-21 14:08 /user/hduser/map/range
-rw-r--r-- 1 hduser supergroup 170 2012-10-21 14:08 /user/hduser/map/range_mapper.py
-rw-r--r-- 1 hduser supergroup 353 2012-10-21 14:08 /user/hduser/map/range_reducer.py

And when I run this command
hadoop jar contrib/streaming/hadoop-*streaming*.jar -input /user/hduser/map/* -output /user/hduser/map/a-output -mapper /home/hduser/range_mapreduce/range_mapper.py -reducer /home/hduser/range_mapreduce/range_reducer.py -file range_mapper.py -file range_reducer.py

It gives following error


Warning: $HADOOP_HOME is deprecated.

packageJobJar: [range_mapper.py, range_reducer.py, /app/hadoop/tmp/hadoop-unjar4576261370031348165/] [] /tmp/streamjob5718955918410559810.jar tmpDir=null
12/10/21 16:16:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/21 16:16:16 WARN snappy.LoadSnappy: Snappy native library not loaded
12/10/21 16:16:16 INFO mapred.FileInputFormat: Total input paths to process : 4
12/10/21 16:16:16 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
12/10/21 16:16:16 INFO streaming.StreamJob: Running job: job_201210211419_0020
12/10/21 16:16:16 INFO streaming.StreamJob: To kill this job, run:
12/10/21 16:16:16 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201210211419_0020
12/10/21 16:16:16 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201210211419_0020
12/10/21 16:16:17 INFO streaming.StreamJob: map 0% reduce 0%
12/10/21 16:16:53 INFO streaming.StreamJob: map 100% reduce 100%
12/10/21 16:16:53 INFO streaming.StreamJob: To kill this job, run:
12/10/21 16:16:53 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201210211419_0020
12/10/21 16:16:53 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201210211419_0020
12/10/21 16:16:53 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201210211419_0020_m_000000
12/10/21 16:16:53 INFO streaming.StreamJob: killJob...
Streaming Command Failed!


Please help me to fix it

VijuPoonthottam said...

I installed hadoop in usr/local/hadoop
and I copied range_mapper.py ,range_reducer.py , and two required files

My hadoop system contains these files

hadoop fs -ls /user/hduser/map
Warning: $HADOOP_HOME is deprecated.

Found 5 items
drwxr-xr-x - hduser supergroup 0 2012-10-21 16:16 /user/hduser/map/a-output
-rw-r--r-- 1 hduser supergroup 146 2012-10-21 14:08 /user/hduser/map/input
-rw-r--r-- 1 hduser supergroup 4 2012-10-21 14:08 /user/hduser/map/range
-rw-r--r-- 1 hduser supergroup 170 2012-10-21 14:08 /user/hduser/map/range_mapper.py
-rw-r--r-- 1 hduser supergroup 353 2012-10-21 14:08 /user/hduser/map/range_reducer.py

And when I run this command
hadoop jar contrib/streaming/hadoop-*streaming*.jar -input /user/hduser/map/* -output /user/hduser/map/a-output -mapper /home/hduser/range_mapreduce/range_mapper.py -reducer /home/hduser/range_mapreduce/range_reducer.py -file range_mapper.py -file range_reducer.py

It gives following error


Warning: $HADOOP_HOME is deprecated.

packageJobJar: [range_mapper.py, range_reducer.py, /app/hadoop/tmp/hadoop-unjar4576261370031348165/] [] /tmp/streamjob5718955918410559810.jar tmpDir=null
12/10/21 16:16:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/21 16:16:16 WARN snappy.LoadSnappy: Snappy native library not loaded
12/10/21 16:16:16 INFO mapred.FileInputFormat: Total input paths to process : 4
12/10/21 16:16:16 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
12/10/21 16:16:16 INFO streaming.StreamJob: Running job: job_201210211419_0020
12/10/21 16:16:16 INFO streaming.StreamJob: To kill this job, run:
12/10/21 16:16:16 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201210211419_0020
12/10/21 16:16:16 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201210211419_0020
12/10/21 16:16:17 INFO streaming.StreamJob: map 0% reduce 0%
12/10/21 16:16:53 INFO streaming.StreamJob: map 100% reduce 100%
12/10/21 16:16:53 INFO streaming.StreamJob: To kill this job, run:
12/10/21 16:16:53 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201210211419_0020
12/10/21 16:16:53 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201210211419_0020
12/10/21 16:16:53 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201210211419_0020_m_000000
12/10/21 16:16:53 INFO streaming.StreamJob: killJob...
Streaming Command Failed!


Please help me to fix it