Tuesday, January 17, 2012

The Grail of Efficiency : premature optimization

Premature optimization is the root of all evil ... most of the time. You're only going to know that it's time to optimize after you've built. This doesn't get us off the hook for building slow systems nor for adding nonessential complexity into our solutions.

Build it first, then measure, then improve. And PS. you're bad at guessing what's wrong.

I like this longer version of knuth's quotation (though he claims he didn't coin the phrase) than the one that shows up on the wikipedia page on program optimization.

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.
-- Donald Knuth, ACM: Structured programing with go to statements

Wednesday, January 11, 2012

"Big Data" book by Nathan Marz

Big Data: Principle and best practices of scalable realtime data systems by Nathan Marz and Samuel E. Ritchie is now available in MEAP/Roughcut edition from Manning.

The early access edition of my book "Big Data" is now available. Use code bd50 for 50% off http://www.manning.com/marz/
---- @nathanmarz on twitter (link)
Questions about the book? Ask me here: news.ycombinator.com/item?id=3444300
---- @nathanmarz on twitter (link)

Nathan tweeted a discount code 'bd50' for 50% off, which is a nice bonus. I ordered a ebook+print bundle my copy on Monday. Looking forward to digging in and updating you all with a review. Excited to read the Appendix on storm.

Table of Contents:

  1. A new paradigm for Big Data - FREE
  2. Data model for Big Data - AVAILABLE
  3. Data storage on the batch layer
  4. MapReduce and batch processing
  5. Batch processing with Cascading
  6. Basics of the serving layer
  7. Storm and the speed layer
  8. Incremental batch processing
  9. Layered architecture in-depth
  10. Piping the system together
  11. Future of NoSQL and Big Data processing
    • Appendix A: Hadoop
    • Appendix B: Thrift
    • Appendix C: Storm

Tuesday, January 10, 2012

Building a jabber bot with Bot::Backend

There are many ways to write a (jabber/xmpp) chat bot. A quick search for "Jabber Bot" turns up Net::Jabber::Bot, Bot::JabberBot, Bot::Backbone::Service::JabberChat, Bot::Jabbot, IM::Engine, AnyEvent::XMPP and more. You'll see even more if you search for XMPP instead.

How to choose?

I started with Net::Jabber::Bot, but ran into problems connecting to google-talk based chat. Jabber modules built on AnyEvent::XMPP (seem to) connect to google-auth better than those built on Net::XMPP. This is moot now that I have a normal jabber server, but it still tripped me up. The pod doc lays out funny, so I patched and filed a change-request at github.

Knowing that I would eventually want an event-loop based app, I focused on AnyEvent::XMPP packages. Bot::Backbone and IM::Engine are both interesting abstractions. IM::Engine is quote "currently alpha quality with serious features missing and is rife with horrible bugs." Perhaps I should be happy that he admits this up front? On to Bot::Backbone and sugary Moose!

Once I realized that the group_domain wouldn't be automatically filled in as "conference.$domain" in Bot::Backbone::Service::JabberChat, I was able to connect to a group chat room on my jabber server!

Follow along with my bot, App::Sulla on github. Thus far I have a working connection that responds to "!time" requests -- I've even fixed the part where it threw warnings on all other messages!

#this code in my dispatch table:
    also not_command not_to_me run_this {
            my ( $self, $message ) = @_;
            respond { "hello world" };
    };

# lead to the following error on every other chat message:
unhandled callback exception on event (message, AnyEvent::XMPP::Ext::MUC=HASH(0x3746818), AnyEvent::XMPP::Ext::MUC::Room=HASH(0x3a6d700) ANY_STRING ): Can't call method "add_predicate_or_return" on an undefined value at perl5/lib/perl5/Bot/Backbone/DispatchSugar.pm line 124.

I decided I wanted to pass the auth and channel information into App::Sulla as parameters to the constructor. This didn't play very well with the service sugar. service executes at compile time and squirrels away the configuration hash for the service into services. I added a before modifier to construct_services which is called during run just before initializing each service.

Monday, January 2, 2012

Storm is upon us!

A storm (from Proto-Germanic *sturmaz "noise, tumult") is any disturbed state of an astronomical body's atmosphere, especially affecting its surface, and strongly implying severe weather.
-- wikipedia:storm

Storm is open-source: distributed and fault-tolerant realtime computation
-- Nathan Marz via twitter
Twitter has open sourced Storm, a distributed real-time processing engine ( Announcement, Github ). Hadoop is to batch as Storm is to stream. This is huge. So huge that it seems to be mostly ignored?!

Storm's primary author, Nathan Marz, is the same guy who created cascalog for hadoop queries (a marriage of Clojure and Datalog running queries in hadoop) and ElephantDB. You'll remember he was working at BackType (their first non-founder hire. Good choice, guys!). Twitter bought BackType and now they are sharing storm with us. Thanks twitter!

Since Storm's release seems to have flown in under the radar and is a difficult generic search term, I've collected up links to the relevant resources.

Resources & Announcements:

StrangeLoop/Infoq presentation
http://www.infoq.com/presentations/Storm
Nathan opensourced storm in the middle of his StrangeLoop 2011 presentation.
The slides below the image update as the video plays.
Storm Slides
http://www.slideshare.net/nathanmarz/storm-distributed-and-faulttolerant-realtime-computation
StrangeLoop 2011 video links
https://thestrangeloop.com/news/strange-loop-2011-video-schedule
A Storm is Coming more details from twitter.
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Storm Source on Github
https://github.com/nathanmarz/storm
Storm Wiki on Github
Wiki Front Page
Rationale
Tutorial
Mailing List
http://groups.google.com/group/storm-user
0.6.1 released
http://groups.google.com/group/storm-user/browse_thread/thread/72b3d4a4aebdebea
Testing Storm Topologies(in Clojure)
http://www.pixelmachine.org/2011/12/17/Testing-Storm-Topologies.html
Overview of Storm Presentation at BashoChats 001
http://basho.com/blog/technical/2011/12/20/Basho-Chats-001-Talk-Videos/
BackType techtalks
http://tech.backtype.com/pages/presentations-8