Monday, January 25, 2010


I'm looking for information on using and extending perlbal. Perlbal is an http load balancer / reverse proxy from sixapart/danga.

There is a bit of documentation with the source code:

I've found some good info in the slides from a YAPC 2009 talk : perpal-tutorial. It's at, so it requires flash or a login.

Ok, now that I've read all that, how do I set it up as a simple proxy/balancer between my internal service and n (small) external hosts, with connection pooling, persistent connections from perlbal to the external hosts?

My internal service will need to quickly send a request to each of the remote servers and then aggregate/modify the collective response. My internal service won't wait very long for the responses from the n hosts => slow responses (<~200ms?) will just be dropped from the algorithm.

... investigation continues.

Saturday, January 23, 2010

Reading List Updates.

I can't go home and not go to Powell's. I can't go to Powell's and not buy some books. I don't want to change either of those facts.

On my trip home over Thanksgiving I had time on my way out to look at books at the airport Powell's location. There were 5 employee recommendations in the business section, and I had trouble not buying them all. I came home with 4 books that day. Today I'm moving two of them into the reading list.

See if you can spot the hidden thread that ties these items all together

Present Like a PRO
"The field guide to mastering the art of business, professional, and public speaking."
I flipped through just a couple of pages and decided to stop flipping and just move it directly into the buy pile. As soon as I finish this post I'm going to spend 1 hour curled up on the couch with it. I hope to have a reivew for you soon. It touches on all the parts of a business presentation, from the mechanics of audio visuals (aka slides), to practicing, to connecting with your audience. I've heard it said that, "The fear of public speaking is the number one fear of American adults." If it's not my number one fear, that is only because I have so many big fears. And I'm planning on knocking those fears off, one (or more) at a time.
How to Connect with Anyone
96 ways to All-New Little Tricks for Big Success in Relationships
This was another recommended book that fell quickly into the buy pile. Flipping through the first couple of points, I was able to tell that these were tips I could follow, as they included both the high level concept, eg "extend eye contact" as well as specific steps to get there. A fake it till you make it approach of which I am so fond. I may be a "social engineer" as the marketing ladies so sweetly refer to me, but I'm still an engineer and nerd. Make a game of it. "Look at their eyes, what color are they? Not just green, but green with little flecks of black and is that gold? What is the shape? The ratio of height to width?" Now go practice. Practice on your friends. Practice in the mirror. Get to where it feels comfortable and you won't play these games. At that point you can actually look into their eyes just to look at them and connect and listen. 95 more tips to go.
Moose::Manual and Moose::Manual::Roles
I've let some of my basic Moose understanding sink in enough that it's time to branch into the concept of Moose Roles. These are like helpers or mix-ins in other languages. Roles are not directly instantiated, but pulled into another class. The Role interface specifies a contract on both sides. The Role must provide certain functions and can require that certain functions be present in the other class. Once loaded, the Role functions are directly accessible in the class. There are several Roles in the Hadoop::Streaming class that I wish to extend. With my presentation around the corner (Wednesday), this will pop to the top of the reading list very soon.

Tuesday, January 19, 2010

hadoop: restart the cluster and run a job

Following the steps in the previous hadoop post, I have a working single instance hadoop cluster on my laptop. Here's a refresher on restarting it and using it.

Start up Hadoop

#login as hadoop user:
sudo -i -u hadoop

#start cluster:

#check cluster is running via jps:
30957 SecondaryNameNode
31046 JobTracker
30792 DataNode
30638 NameNode
31533 Jps
31205 TaskTracker

Upcoming Tech Events (Jan and Feb)

After the Christmas lull everything is going full bore again! Let's find out what fun tech events are coming up. Fun for the whole nerd family!


Jan 1-17

Saturday Jan 9: LiLAX Linux Users Group. Dan Kegel, google chrome engineer, talking about WINE (Wine Is Not an Emulator). Really wanted to hit this one, bummed I couldn't fit it in. Would have tried a little harder if I knew Dan was speaking. Recurring on the second sat of the month.
Tue: Hadoop Meetup: quarterly community meeting
Bar Camp San Diego Jan 16-17. Any word on a new date for BarCampLA?

Jan 18-24

Happy MLK Day!
Tuesday: 1/19 : LA Talks Tech 7pm.
Thusday: : Mars, Social cachet and flexible reality.

Jan 25-31

Mon: 1/25: SXSW Interactive L.A. Networking Event 7-9pm. RSVP by Friday.
Wed: 1/27: Perl Mongers at Rubicon Project 7-9pm. Hadoop + Perl
Thu: 1/28: Twistup LA come see the area's top startups.


Tue: Feb 9 : Hadoop Meetup @mahalo. Sat: Feb 13: LiLAX: Linux Users Group (estimated. second saturday of month).
Feb 22: LA WebDev meetup @Santa Monica Public Library. First event of the year. I'll be on vacation. Someone tell me how it goes, thanks.

Looking for something less established where you can get in on the ground floor? Have you considered, the new public hacking space in Culver City? Sean just got that opened up in late December, and there are <100 members. Get in early. You can probably even still make your own event like "Show and Tell" Friday, "Take-Apart-Tuesday" and "Sunday Craftday."

LA Perl Mongers - January 27, 2010

The website[0] is (finally) updated with information about the next LA Perl Mongers meeting, to be held Wednesday January 27, 2010.

Our first presentation will be an introduction to using the streaming interface[1] to Hadoop[2] from Perl, using Hadoop::Streaming[2]. After a brief overview of Hadoop the talk will focus on building streaming jobs and getting the necessary infrastructure in place to support them.

Our second presentation slot is open, ready for a volunteer. Otherwise I'll do a second presentation on packaging CPAN distributions using Dist::Zilla[4]. "Dist::Zilla - distribution builder; installer not included!"


Tuesday, January 12, 2010

Hack day with Kenny: Fey::ORM, testing and screen.

After sleeping through the Linux@LAX users group meeting (sorry guys), I rolled up to Kenny's (Kenny Flegal), where he had invited me for a day of coding and authentic Salvadorian food. Win Win!

I showed him briefly the topic of my upcoming Monger's presentation, but mostly we looked at his current project. He is forking a GPL licensed project, to recreate part of the functionality and extend it in a different direction. Along the way he's rewriting the app layer in perl from command line php scripts.

We discussed the various clauses of the Gnu Affero GPL with regards to the hosting of the project during the initial revs. Can he have a public repository before he has finished changing all references to the old name to a new name and adding "prominent notices stating that you modified it, and giving a relevant date" as per Section 5, paragraph a? We decided that he probably could, but that it'd be easier to start with a private repo and not publish until that part is done. That seems sub-optimal from a "getting the source to the people" mindset, but it is more optimal in the "protect the good name of the original project and publishers." For a fork that won't follow up-stream patches, does one just make a single prominent notice to that affect like, "forked from project XYZ on 2010-01-02?"

Along with switching from php to perl, he's pulling out the hard coded sql from the scripts and moving to an ORM. He's picked Dave Rolsky's impressive Fey ORM. This project has a ridiculously complex set of schemas, with inconsistent table names and not explicit foreign key constraints. As such, it is extra work to get the fey schema situated.

Kenny started to give me a run through of some of the code, but it was awkward with both of us on laptops to see the code conveniently. I made him stop and set up a screen session for sharing, as described in my previous post on screen. This was more difficult than I expected, with the problem eventually being that ubuntu 9.4 and beyond has moved /usr/bin/screen to /usr/bin/screen.real and made screen a shell wrapper. The screen multiuser ACL system requires that the screen binary be setuid (chmod +s). With this setup we needed to make screen.real setuid. That took a while to notice.

Once we had a shared session open, it was much easier for him to give me a guided tour of the codebase and database/sql setup. Once that was clear it was time to get some code started. He showed me some of the Fey::ORM model code and how he was migrating over the individual sql statements to the ORM. He had been plugging away on the model code for a while, starting by creating a comment for every line of sql in the application including the file and line of the caller.

The next step was clear, we needed some tests. We set to work getting an initial test of the model code. First we installed Fey::ORM::Mock as a mock layer. This works at a higher level than a standard DBD::Mock interface to allow better testing of the Fey::ORM features. The test didn't pass at first due to missing data in the mock object, so we grabbed a list of the fields that mapped to DB fields and started adding values to pass constraint failures on the data. Once we had a minimal set of data then we started to see problems with the ORM schema description. The lack of well defined foreign key constraints meant we needed to explicitly define that structure for the ORM. More boilerplate code into the model. We repeated this test-update-repeat cycle a few more times adding more data linkage descriptions.

I took a brief break from our pairing and jumped to a different screen to install some goodies. I grabbed a copy of the configuration files from the December talk and started updating his config. He didn't have a .vimrc, .vim or .perltidyrc on this brand new dev box, so I pulled those in from the repo. I showed him how much time using ":make" in vim could slice off his build/test cycle, and he was super excited. (ok, not till the third or fourth try but he eventually got the hang of it).

To get around some issues in code placement, I modified the .vimrc and .vim/ftplugin/compiler code to add -MFindBin::libs to the calls to perl -c and prove. This allowed the parent libs/ directory to be found for these non-installed modules. This is a bit of a hack and I'll get it removed as we move closer to an initial release and pick a packaging tool, possibly Dist::Zilla.

An open question is the speed of Fey::ORM. It takes a big startup hit while building the models from the schema and interacting with the database. This is supposed to lead to a big speed gain during runtime from aggressive caching of that information. All I know for certain is that the compile-run-test cycle was really slow. This is my first time using Fey so I don't know how this plays out normally. It could just be that the number of crosslinked tables in the db config were causing additional slowdowns.

By this point we had already had two delicious meals of El Salvador cuisine and it was approaching midnight. The first meal was home cooked fried (skinless) chicken for lunch and the second was papoosas at a local, excellent place in Van Nuys. I was all coded out, which made for a perfect transition to the party at Andy Bandit's that night, conveniently just 6 miles from Kennys.

All in all, a fine Saturday.

Thursday, January 7, 2010

Perl Iron Man challenge -- is the cron still running?

Am I linking to the correct image files?
In the Matt's Announcement, he said to link to the images in I see all of those images as having a modified date of 2009-10-03. Perhaps that is just a trick of the linked files, as that is the modified date for the live image files. All of the CSV Files seem to be really old too. Is that not the path that the CSV cron job is using?

Now I'm wondering if I really missed a day or two in there where I pushed it to 10 days when I was slacking (for weeks on end) by doing work-work instead of blogging, or if I really have hit IRON MAN status after 6 months of blogging. I thought today was 6 months, but it's actually 7 months! I still show up as Bronze Man status I hope I didn't push too many 10 day windows in a row and risk not hitting the 4 posts per 32 days rolling window. (I don't think I knew that was a requirement, before reading the rules just now. Well, I must have known that at some point, as I made a post about it many moons ago.

[andrew@mini]% date -d '2009-06-06 + 7 months'
Wed Jan 6 00:00:00 PST 2010

Tuesday, January 5, 2010

CPAN upload, ONE!

My very first CPAN upload is happening *RIGHT NOW*. go go gadget CPAN upload! And now off to sleep, really.

2010-01-05 13:26:15 $$23346 v1048: Info: Need to get uriid[S/SP/SPAZM/Hadoop-Streaming-0.100050.tar.gz] (paused:333)
2010-01-05 13:26:15 $$23346 v1048: Info: Going to fetch uriid[S/SP/SPAZM/Hadoop-Streaming-0.100050.tar.gz] (paused:619)
2010-01-05 13:26:15 $$23346 v1048: Info: Requesting a GET on uri [] (paused:641)
2010-01-05 13:26:16 $$23346 v1048: Debug: ls[-rw-rw-r-- 1 root root 16457 2010-01-05 13:26 /home/ftp/tmp/S/SP/SPAZM/Hadoop-Streaming-0.100050.tar.gz
]zcat[/bin/zcat]tpath[/home/ftp/tmp/S/SP/SPAZM/Hadoop-Streaming-0.100050.tar.gz]ret[]stat[2057 470407906 33204 1 0 0 0 16457 1262694376 1262694375 1262694375 4096 40]: No child processes (paused:696)
2010-01-05 13:26:16 $$23346 v1048: Info: Got S/SP/SPAZM/Hadoop-Streaming-0.100050.tar.gz (size 16457) (paused:492)
2010-01-05 13:26:18 $$23346 v1048: Info: Sent 'has entered' email about uriid[S/SP/SPAZM/Hadoop-Streaming-0.100050.tar.gz] (paused:555)
2010-01-05 13:27:43 $$23346 v1048: Info: Verified S/SP/SPAZM/Hadoop-Streaming-0.100050.tar.gz (paused:304)
2010-01-05 13:27:43 $$23346 v1048: Info: Started mldistwatch for lpath[/home/ftp/pub/PAUSE/authors/id/S/SP/SPAZM/Hadoop-Streaming-0.100050.tar.gz] with pid[24954] (paused:309)
2010-01-05 13:27:53 $$23346 v1048: Debug: Reaped child[24954] (paused:64)

I think it should show up soon on my cpan page or perhaps the Hadoop::Streaming search.cpan page?

Hadoop Streaming - running a job

Logs from running a hadoop streaming job in a freshly set up environment
  1. copy input files
  2. run streaming jar (now located in contrib/streaming)
  3. hope.

1) copy input files to dfs:
hadoop dfs -copyFromLocal examples/wordcount wordcount

2) run streaming job:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+152-streaming.jar \
-input wordcount \
-output wordcountout \
-mapper examples/wordcount/ \
-reducer examples/wordcount/ \

packageJobJar: [/home/hadoop/tmp/hadoop-hadoop/hadoop-unjar5876487782773207253/] [] /tmp/streamjob4555454909817451366.jar tmpDir=null
10/01/05 03:29:34 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/05 03:29:35 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/tmp/hadoop-hadoop/mapred/local]
10/01/05 03:29:35 INFO streaming.StreamJob: Running job: job_201001050303_0003
10/01/05 03:29:35 INFO streaming.StreamJob: To kill this job, run:
10/01/05 03:29:35 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201001050303_0003
10/01/05 03:29:35 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001050303_0003
10/01/05 03:29:36 INFO streaming.StreamJob: map 0% reduce 0%
10/01/05 03:30:19 INFO streaming.StreamJob: map 100% reduce 100%
10/01/05 03:30:19 INFO streaming.StreamJob: To kill this job, run:
10/01/05 03:30:19 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201001050303_0003
10/01/05 03:30:19 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001050303_0003
10/01/05 03:30:19 ERROR streaming.StreamJob: Job not Successful!
10/01/05 03:30:19 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Sigh. Failure. Time to debug the job. Definitely needs the -file flag to bundle the executables to the remote machine.

Step back and run a simpler job.
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+152-streaming.jar -input wordcount -output wordcountout3 -mapper /bin/cat -reducer /bin/wc

10/01/05 03:36:17 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001050303_0004
10/01/05 03:36:17 ERROR streaming.StreamJob: Job not Successful!
10/01/05 03:36:17 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

This is the example code. Why did it fail again? Is there a problem with my input file?

I'll pick this back up tomorrow (or later). A little birdie just told me it was 3:40am. Well past my bedtime. I'd like to finish this up, but that's what I've been saying since I started at midnight. I have the Hadoop::Streaming::Mapper and ::Reducer modules all packaged up and ready to make an initial push to CPAN, but first I need to get an example running under hadoop. I did finish writing the tests for the non-hadoop case, and those are clean and ready. Feel free to follow along at my github repository.


or maybe I'll just try a few more times...

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+152-streaming.jar -input wordcount -output wordcountout7 -mapper -reducer -file examples/wordcount/ -file examples/wordcount/

packageJobJar: [examples/wordcount/, examples/wordcount/, /home/hadoop/tmp/hadoop-hadoop/hadoop-unjar390944251948922559/] [] /tmp/streamjob7610913425753318391.jar tmpDir=null
10/01/05 03:59:11 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/05 03:59:11 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/tmp/hadoop-hadoop/mapred/local]
10/01/05 03:59:11 INFO streaming.StreamJob: Running job: job_201001050303_0010
10/01/05 03:59:11 INFO streaming.StreamJob: To kill this job, run:
10/01/05 03:59:11 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201001050303_0010
10/01/05 03:59:11 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001050303_0010
10/01/05 03:59:12 INFO streaming.StreamJob: map 0% reduce 0%
10/01/05 03:59:23 INFO streaming.StreamJob: map 100% reduce 0%
10/01/05 03:59:35 INFO streaming.StreamJob: map 100% reduce 100%
10/01/05 03:59:38 INFO streaming.StreamJob: Job complete: job_201001050303_0010
10/01/05 03:59:38 INFO streaming.StreamJob: Output: wordcountout7


hadoop dfs -ls wordcountout7;

Found 2 items
drwxr-xr-x - hadoop supergroup 0 2010-01-05 03:59 /user/hadoop/wordcountout7/_logs
-rw-r--r-- 1 hadoop supergroup 125 2010-01-05 03:59 /user/hadoop/wordcountout7/part-00000

hadoop dfs -cat wordcountout7/part*

apple 2
bar 2
baz 1
c 1
c++ 2
cpan 9
foo 2
haskell 4
lang 1
lisp 1
ocaml 2
orange 2
perl 9
python 1
ruby 4
scheme 1
search 1