The Solved Mystery of the Crashing Starling

Starling is the simple queue server that backs Twitter, and Workling is the great queue framework by Rany Keddo that is built on top of Starling (and works with other queue servers, e.g. RabbitMQ). It lets you throw a job on a queue server, and have something consume those jobs on a different process / different machine. This is great for long-running tasks like sending emails, or creating denormalized data, which is what we use StarlingWorkling for.

But we ran into catastrophic crashes with Starling (memory use bloat over the course of a day to 1 GB of RAM, no longer responding to queries). At first, we were confused, and a restart of Starling would always solve the problem. This was fine since jobs that hadn't been completed get flushed to disk and re-read from disk with Starling. Starling is absurdly simple, actually -- it's just a modified memcached server that lets you "throw a job on there" -- it stores a job id, a hash object of parameters, and that's about it. Using the memcache 'get' command, the workling client de-queues the job (which destroys it from the queue), and using the memcache 'put' command your rails proc can put one on there. First in, first out.

After a while, we realized the problem was actually with the way we used Starling/Workling. Workling has a "return store" that allows you to pass a message back from your running queue that something is "done" -- or it can be used for progress as well.

The offending code was here, inside each one of our asynchronous worker code:
    Workling.return.set(options[:uid], "DONE")   

This looked innocuous enough. You want to be able to know when a given offline job is done, right? However, the Starling return store actually ends up just queuing another "job", with the key being unique to options[:uid]. But we weren't ever "de-queueing" that job -- meaning for many of our jobs we never retrieved the status of it, and it would always be stored and never destroyed. So thus, Starling was getting absolutely filled with these one-off return jobs (tens of thousands of them) throughout the course of a day. After a day or so, it would just giving up the ghost.

We swapped the code to this, and now we've had 100% uptime for weeks and weeks. Whenever we want to get done status, we also pass in the flag 'set_return_store' to true.
    Workling.return.set(options[:uid], "DONE") if options[:set_return_store]   

Thank God for open source -- if we couldn't dig through the source, we'd never be able to truly understand what was happening. That's one of the biggest lessons to learn when working deeply with open source: search Google, but when nothing comes up, be sure to read the source and understand what's really happening. Nobody's going to hold your hand through this.

XSS dangers: A classic reason to use image_tag instead of IMG tags in your rails view code.

This code seems fine in rails view code, right?

<img src="<%=@image%>" alt='<%=label%>'>

Wrong. What if someone drops this into your user generated content label?

"onmouseover=alert(document.cookie)

Bad times ensue, because users can inject their own evil JS that actually BREAKS OUT of the tag and run cross-site scripts. Always use link_to and image_to when dealing with user generated text. It'll automatically strip out text that otherwise would break out of the tag string. Special thanks to @Stephen on twitter for being a great security guru.

The response time fallacy: It's how many, not how fast. Duh.

So, Marissa ran an experiment where Google increased the number of search results to thirty. Traffic and revenue from Google searchers in the experimental group dropped by 20%.

Ouch. Why? Why, when users had asked for this, did they seem to hate it?

After a bit of looking, Marissa explained that they found an uncontrolled variable. The page with 10 results took .4 seconds to generate. The page with 30 results took .9 seconds.

Half a second delay caused a 20% drop in traffic. Half a second delay killed user satisfaction.
--Why Front End Performance Matters to Everyone, Not Just the High Traffic Giants via drunkenfist.com
   
This is a classic fallacy of correlation being mistaken for causation. How about 30 results overwhelming the user? Too much scrolling = forget it, I don't care anymore. Oh God, the Internet is so big. I'll never find what I'm looking for. That is what users think when you present them with 3 times the number of search results they're used to. That's the real explanation, not this response time explanation that is superficial at best.

Response time is very important. Of course it is. But superior user experience design will trump negligible performance optimizations every day of the week.

RSpec's should_receive doesn't act quite the way you would guess!

For anyone more experienced with RSpec, this would be pretty obvious, but it took me some time to realize this was what was causing some crazy behavior that I just plain could not understand.

I'm writing RSpec tests for some code that lives in static methods. For illustration purposes, lets say it looks like this:


module Posterous
   class Foo

     def self.do_something
        do_something_delegate
     end

     def self.do_something_delegate
        puts "Hello world!"
        Posterous::Foo.complicated_stuff
    end

    def complicated_stuff
    ...
    end

  end
 end
 
So then I write a simple RSpec test that does this:

 it "should call do_something and its delegate" do 
   Posterous::Foo.should_receive(:do_something_delegate)
   Posterous::Foo.should_receive(:complicated_stuff)
   Posterous::Foo.do_something
 end
 
So what do you expect will happen? I expected that the test would pass -- in theory, nothing about the term should_receive screams to me that the method invocation would change at all.

That's just not how it works, however. Because should_receive is a part of the Spec::Mocks lib of RSpec, it actually DOES cause the method "do_something_delegate" to act like a mock. A mock model is an actual stand-in, and in this case, because we don't define a return value. We could return a mock value here if we called Posterous::Foo.should_receive(:do_something_delegate).and_return("my value here").

So do_something_delegate ends up becoming a no-op (and that passes), but then it fails on Posterous::Foo.should_receive(:complicated_stuff). This makes sense, since complicated_stuff was INSIDE of do_something_delegate, and since do_something_delegate is mocked and never actually called. Hence complicated_stuff is skipped entirely and that test fails.

And actually, that's what you want. Unit tests at this level should be simple, and should test only one layer of your call stack at any given point. Integration/controller tests can be used for end-to-end testing, but this kind of mock behavior is ideal because it forces you to only test that method call and the behavior contained within.

Sometimes the best lessons come from banging your head against a problem until you radically change the way you think.