Sunday, February 27, 2011


Sweet, a new module popped onto the CPAN this weekend, Thrift::XS, an XS version of Thrift. This is doubly nice -- one, it's a faster XS version. two, it's available directly on cpan. The current module from the Apache Thrift project requires finding and downloading the package.

Thrift is a streaming serialization format. See also Protocol Buffers and Avro.


Thrift::XS provides faster versions of Thrift::BinaryProtocol and Thrift::MemoryBuffer. On average it is about 4-6 times faster.

Thrift compact protocol support is also available, just replace Thrift::XS::BinaryProtocol with Thrift::XS::CompactProtocol.

To use, simply replace your Thrift initialization code with the appropriate Thrift::XS version.

Scale9x: Take Advantage of Modern Perl

Chromatic's talk on Modern Perl at Scale9x is in about an hour -- 11:30am, Sun Feb 27, 2011. If you can't make it, at least check out the live stream.

I really shouldn't have gone to scale yesterday, since I'm so sick and it wiped me out. Yet here I am contemplating going again today. I do want to get my copy of Modern Perl autographed, afterall.

Perl's recent renaissance has produced amazing tools that you too can use today.

This talk explains the philosophy of language design apparent in Perl 5 along the two fundamental axes of the language: lexical scoping and pervasive value and amount contexts. It also discusses several important pragmas and language extensions to improve Perl 5's defaults, to reduce the chance of errors, to allow better abstractions, and to encourage the writing of great code.

Speaker: chromatic x

Friday, February 25, 2011

Mining of Massive Datasets textbook.

I started reading Mining of Massive Datasets on vacation. I didn't get very far into it, as it isn't exactly light beach reading. The first bit is a review covering things I mostly don't know, so that was a fun start. I now have a better feeling for IDF and TF.IDF, for instance.

Infolab seems down at the moment.

Mining of Massive Datasets.

new business opportunities

[This business opportunitiy] is a wide open space with lots of people jumping into the pool without knowing how to swim.
We should be able to make a mint selling life preservers.

Thursday, February 10, 2011

LA Hadoop

Great attendance at the Los Angeles Hadoop Users Group (LA Hug) meetup last night on "Productizing Hadoop." Cloudera provided a great speaker to discuss the do's and don't's of migrating hadoop from play/development to full enterprise mode ( from hunter gatherer to modern city). The Hadoop infrastructure has come a long way since my first LA hadoop meetup 1+ year ago -- better support for multi-tenancy with auth and authz, more tools built on top of hadoop, and less need to roll-your-own scripts for everything.

Props to Shopzilla for hosting.

This was a much shyer crowd than we see at LA Perl Mongers (LA.PM). Only one other person asked a question at the end. At PM, we tend to pepper questions and feedback all along the presentation making everything a group production.

Cpanm 1.1 -- now with mirror support!

There is a new version of cpanm (App-cpanminus) that supports --mirror and --mirror-only to allow offline usage.

Kick ass! Thanks again miyagawa

cpanm 1.1 is shipped, and with `--mirror-only` option, you can use it with your local minicpan mirror, or your own company's CPAN index (aka DarkPAN).

The only reason for a few experienced perl programmers who loves cpanm but can't use cpanm offline or at work was the lack of the proper mirror index querying support.

cpanm always has required an internet connection to resolve module name and dependencies, and always relies on CPAN Meta DB and to query package index.

It's been a fair requirement for 95% of the usage, but again, for an experienced hacker who spends their most of airplane's time hacking code on their laptops, the offline support to fallback to local minicpan would be really nice. (Even though many airlines nowadays provide in-flight Wi-Fi :))

So I opened a bug to support `--mirror-only` option to bypass these internet queries and parse mirror's own 02packages.txt.gz file for module resolution a while ago, and a couple of people have tried implementing it in their own branches. (Thank you!)

Today I merged one of those implementations, and improved a little bit to make it run even faster and more network efficient. The way to use it is really simple, just run cpanm with options such as:

cpanm --mirror ~/minicpan --mirror-only Plack

and it will use your minicpan local mirror as the only place to resolve module names and download tarballs from. (TIP: you can alias this like `minicpanm` to save typing)