Tag workling

The Solved Mystery of the Crashing Starling

Starling is the simple queue server that backs Twitter, and Workling is the great queue framework by Rany Keddo that is built on top of Starling (and works with other queue servers, e.g. RabbitMQ). It lets you throw a job on a queue server, and have something consume those jobs on a different process / different machine. This is great for long-running tasks like sending emails, or creating denormalized data, which is what we use StarlingWorkling for.

But we ran into catastrophic crashes with Starling (memory use bloat over the course of a day to 1 GB of RAM, no longer responding to queries). At first, we were confused, and a restart of Starling would always solve the problem. This was fine since jobs that hadn't been completed get flushed to disk and re-read from disk with Starling. Starling is absurdly simple, actually -- it's just a modified memcached server that lets you "throw a job on there" -- it stores a job id, a hash object of parameters, and that's about it. Using the memcache 'get' command, the workling client de-queues the job (which destroys it from the queue), and using the memcache 'put' command your rails proc can put one on there. First in, first out.

After a while, we realized the problem was actually with the way we used Starling/Workling. Workling has a "return store" that allows you to pass a message back from your running queue that something is "done" -- or it can be used for progress as well.

The offending code was here, inside each one of our asynchronous worker code:
    Workling.return.set(options[:uid], "DONE")   

This looked innocuous enough. You want to be able to know when a given offline job is done, right? However, the Starling return store actually ends up just queuing another "job", with the key being unique to options[:uid]. But we weren't ever "de-queueing" that job -- meaning for many of our jobs we never retrieved the status of it, and it would always be stored and never destroyed. So thus, Starling was getting absolutely filled with these one-off return jobs (tens of thousands of them) throughout the course of a day. After a day or so, it would just giving up the ghost.

We swapped the code to this, and now we've had 100% uptime for weeks and weeks. Whenever we want to get done status, we also pass in the flag 'set_return_store' to true.
    Workling.return.set(options[:uid], "DONE") if options[:set_return_store]   

Thank God for open source -- if we couldn't dig through the source, we'd never be able to truly understand what was happening. That's one of the biggest lessons to learn when working deeply with open source: search Google, but when nothing comes up, be sure to read the source and understand what's really happening. Nobody's going to hold your hand through this.