Netflix has a Chaos Monkey: A process to randomly kill things so as to engineer for failure

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

On the face of it, it seems insane. Why would you intentionally kill parts of your site? Yet in practice, being able to handle failure consistently means you are never ever surprised.

Edit: My friend Vinny Magno writes:

Automobile assembly lines started doing the same thing (I think Ford was the first, though I might be wrong).

A single line can produce multiple models back to back. So for example, a Focus may be followed by an Explorer and then an Escape. At one point, the chassis and body are aligned and welded together. Obviously welding a Focus body to an Explorer chassis would be a problem, but the automation is so good that the defect rate was incredibly low. Since such a defect happened so infrequently, operators became complacent and failed to catch it the times that it did occur. To solve the problem, Ford purposefully upped the error rate to keep the operators sharp.

Turns out there's precedence!

ActiveRecord Table Transform (or, how to write to the db 27,000 times in 24 seconds instead of 9 minutes)

I implemented my new scheme and running time went from 9 minutes to 24 SECONDS. I liked this approach so much I decided to generalize it as ActiveRecord::Base.transform. Sample usage:

# if users don't have names, give them a random one
NAMES = ['Adam', 'Ethan', 'Patrick']
User.transform(:name, :conditions => 'name is null').each do |i|
  i.name = NAMES[rand * NAMES.length]
end

Really interesting use of temp tables here.

Internationalized Domain Names: Holy crap all of these work?

I had no idea. Posterous support for these bad boys coming soon.

Gruber's URL Regular Expression Explained

While America threw on its eating pants and combed the Thursday circulars for deals, John Gruber spent Thanksgiving preparing to unveil his regular expression for finding URLs in arbitrary text.

\b(([\w-]+://?|www[.])[^\s()]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

Pretty dense. Let’s be that guy and break it out, /x style (ignoring white space, with comments)

\b                          #start with a word boundary
(                           #
    (                           #
        [\w-]+://?                  #protocol://, second slash optional
        |                           #OR
        www[.]                      #a literal www. 
    )                           #
    [^\s()]+                  #non-whitespace, parens or angle brackets
                                #repeated any number of times
    (?:                         # (end game)
        \([\w\d]+\)                 #handles weird parenthesis in URLs (http://example.com/blah_blah_(wikipedia))
                                    #won't handle this twice foo_(bar)_(and)_again
        |                           #OR
        (                           #
            [^[:punct:]\s]              #NOT a single punctuation or white space
            |                           #OR
            /                           #trailing slash
        )                           #
    )                           #
)                           #

Amazing writeup by Alan Storm on Gruber's autolinking regex.

Update: Gruber updated this with an even more awesome regex, with breakout explanation

How #newtwitter uses shbangs (#!) to let Google crawl AJAX content

If you're a Twitter user, logged-in, and have Javascript, you'll be able to see my profile here:

However, Googlebot will recognize that as a URL in the new format, and will instead request this URL:

Sensibly, Twitter want to maintain backward compatibility (and not have their indexed URLs look like junk) so they 301 redirect that URL to:

(And if you're a logged-in Twitter user, that last URL will actually redirect you back to the first one.)

seomoz.org, dependable as usual.

God bless @scribd, they've released open source to solve that godawful Flash wmode z-indexing bug.

The dreaded problem that most web developers come across is the z-index issue with flash elements. When the wmode param is not set, or is set to window, flash elements will always be on top of your DOM content. No matter what kind of z-index voodoo you attempt, your content will never break through the flash. This is because flash, when in window mode, is actually rendered on a layer above all web content.

There is a lot of chatter about this issue, and the simple solution is to specify the wmode parameter to opaque or transparent. This works when you control and deliver the flash content yourself. However, this is not the case for flash ads.

...

So, to solve all this, I wrote some javascript that will dynamically add the correct wmode parameter. I call it FlashHeed. You can get it now on the GitHub repo.

It works reliably in all major browsers, and has no dependencies, so feel free to drop it into your Prototype or jQuery dependent website.

 

Impressive tech talk on Scaling MySQL at Facebook

Facebook gave a MySQL Tech Talk where they talked about many things MySQL, but one of the more subtle and interesting points was their focus on controlling the variance of request response times and not just worrying about maximizing queries per second.

But first the scalability porn. Facebook's OLTP performance numbers were as usual, quite dramatic:

  • Query response times: 4ms reads, 5ms writes. 
  • Rows read per second: 450M peak
  • Network bytes per second: 38GB peak
  • Queries per second: 13M peak
  • Rows changed per second: 3.5M peak
  • InnoDB disk ops per second: 5.2M peak

This looks to be a must-see tech talk for anyone on the scaling / infrastructure side of the web.

 

Watch live streaming video from facebookevents at livestream.com

What if Foursquare wasn't using MongoDB and was using something else?

People have chimed in and talked about the Foursquare outage. The nice part about these discussions is that they’re focusing on the technical problems with the current set up and Foursquare. They’re picking it apart and looking at what is right, what went wrong, and what needs to be done differently in MongoDB to prevent problems like this in the future.

Let’s play a “what if” game. What if Foursquare wasn’t using MongoDB? What if they were using something else?

Riak? Cassandra? HBase? MySQL?

Jeremiah Peschka fills us in.

Finally, timezones in javascript done right

The world has been made a better place via this fine javascript library. Timezones have always been a bit of a tricky thing to handle. 

The author says:

You can’t have a named time zone in javascript (example : eastern time or central time), you can only have a time zone offset which is represented by universal time (UTC) minus the distance in minutes to it by callingdateVariable.getTimezoneOffset(). It means that if the time zone offset is -1 hours of UTC, javascript will give you 60. Why is it inverted in javascript? I have no idea.

In winter, it’s always standard time. In summer, it’s daylight saving time which is standard time minus 60 minutes… but not for every country. Plus, summer and winter are inverted in the southern hemisphere. That’s a lot of exceptions and that’s why I created the jsKata.timezone library.

Find it on github.