Tag ruby on rails

Building and Scaling a Startup on Rails: 12 Things We Learned the Hard Way

There are a bunch of basic functional elements to building out a popular Rails app that I've never really seen explained in one place, but we had to learn the hard way while building Posterous. Here's a rundown of what we've learned, in the hopes that some Google linkjuice may bring an intrepid wanderer to this forlorn part of the woods and help you out.

Static Storage
S3 is awesome. Yes, you can host everything off S3. Is it a little more expensive? Probably. But if you're engineer constrained (and if you're a startup, you absolutely are) -- set it and forget it. If you absolutely must have local storage across app servers, then MogileFS, GFS, or HDFS or even NFS (yuck) are possibilities. For alternatives to S3, Mosso is supposed to be good too.

Images, files, whatever. Just drop it there. People say a lot of stuff about the Cloud, but it's real and a game changer for anyone doing user generated content.

HTTP Cache Control
The HTTP protocol lets you tell browsers what static content they can cache. You set this in apache.  Rails automatically will put timestamps in the IMG / javascript / CSS tags, assuming you're using the helpers. The Firefox plugin YSlow coupled with Firebug are your friends here. The improvement is significant and well worth your time, especially if you add gzip'ing. 100KB initial page load can be brought down to 5K (just the HTML file) on subsequent clicks around your site.

You're not going to run full text search out of your DB. It's totally not worth it to roll anything custom here. The smart money is on Sphinx with the ThinkingSphinx plugin is probably your best bet. If you have more than one app server, you'll want to use this. Alternatively, Solr with Acts as Solr can be used if you're a Java geek / have Lucene/Solr experience previously.

Storage engine matters, and you should probably use InnoDB

MyISAM is marginally faster for reads, but InnoDB will make you more crash resistant and will not lock tables on writes. Read about the difference, because when your servers are on fire, you will realize MySQL feels like a pretty thin layer of goop on top of your storage engine. MyISAM is actually the default on MySQL, which makes sense for most crappy phpBB installations -- but probably not good enough for you. The default can hurt you.

Oh yeah, and if you can start with some replication in place, do it. You'll want at least one slave for backups anyway.

Fix your DB bottlenecks with query_reviewer and New Relic
This basically saves your ass completely. Everyone complains that Rails is slow. Rails is not slow, just like Java Swing is not slow. Rails makes it easy to shoot yourself in the face. If you do follow-the-textbook-example bumbling around with Rails ActiveRecord objects, you will end up with pages that drive 100 queries and take several seconds to return.

Above is a screenshot from query_reviewer. It tells you every single query being run, and alerts you to things that use temporary tables, file sorts and/or just damn slow queries.

In a nutshell, you need indexes to avoid full table scans. The traditional way is to run EXPLAIN manually on queries coming out of your dev log. Query_reviewer lets you see it all right there in the left corner of your web browser. It's brilliant. You also need to eager load associations that will use in your views by passing :include to your ActiveRecord find method call, so that you can batch up SQL queries instead of destroying your DB server with 100 queries per dynamic page.

New Relic is new for us, but it helps us see what is really happening on our production site. If your site is on fire, it's a freaking beautiful gift from the heavens above. You'll see exactly what controllers are slow, which servers in your cluster, how load is on all your machines, and which queries are slow.

Memcache later
If you memcache first, you will never feel the pain and never learn how bad your database indexes and Rails queries are. What happens when scale gets so big that your memcache setup is dying? Oh, right, you're even more screwed than you would have been if you got your DB right in the first place. Also, if this is your first time doing scaling Rails / a db-driven site, there's only one way to learn how, and putting it off til later probably isn't the way. Memcache is like a bandaid for a bullet hole -- you're gonna die.

You're only as fast as your slowest query.
If you're using nginx or Apache as a load balancer in front of a pack of mongrels (or thins or whatever else is cool/new/hip), then each of those mongrels acts like a queue. The upshot is that if you EVER have a request that takes a long time to finish, you're in a world of hurt. So say you have 4 mongrels, and Request A comes in to port 8000 and it takes 10 seconds. The load balancer is naive and keeps passing requests to Port 8000 even though that port is busy. (Note: This might help, but we don't use it)

Then what happens? Sad town happens. 1 in 4 requests after Request A will go to port 8000, and all of those requests will wait in line as that mongrel chugs away at the slow request. Effective wait time on 1/4th of your requests in that 10 second period may be as long as 10 seconds, even if normally it should only take 50msec!

Enter the wonderful mongrel proctitle. Now, you can see exactly what is blocking your mongrels. I keep this on a watch in a terminal at all times. It's what I look at immediately if our constant uptime tests tell us something's wrong. Super useful.

The answer is: a) run some mongrels dedicated to slow running jobs (meh) or b) run Phusion Passenger, or c) run slow stuff offline... which leads us to...

Offline Job Queues
So you gotta send some emails. Or maybe denormalize your DB. Or resize photos, or transcode video or audio. But how do you do it in the 200msec that you need to return a web request? You don't. You use Workling or Delayed Job or nanite. It'll happen outside of your mongrels and everyone will be happier.

I don't know why people don't talk about this more, because if you run a site that basically does anything, you need something like this. It *should* be a part of Rails, but isn't. It isn't a part of Rails in the same way that SwingWorker in Java wasn't a part of Java Swing core like forever, even though it absolutely had to be.

If you don't monitor it, it will probably go down, and you will never know.
Test your site uptime, not just ping but actual real user requests that hit the DB. Sure, you could use pingdom if you're lazy, but it seriously takes like 10 lines of ruby code to write an automated daemon that runs, does a user action and checks that your site is not hosed. open-uri is your friend. You don't know if you're up if you're not checking. Do not tolerate downtime.

Also, use god for mongrel and process monitoring. Mongrels die or go crazy. You gotta keep them in their place. (What's funny is that god leaks memory over time with Ruby 1.8.6 *sigh*). Munin, monit, and nagios are also great to have.

Keep an eye on your resources -- IO ok? Disk space? It's the worst thing every to have a site crash because you forgot to clean the logs or you ran out of disk space. Make cronjobs for cleaning all logs and temp directories, so that you can set it and forget it. Because you will forget, until you are reminded in the worst way.

Read the source, and cut back the whining
You will learn more reading the source and debugging / fixing bugs in plugins and sometimes Rails itself than a) complaining on a mailing list or b) whining about shit on your twitter. It's Ruby open source code -- if it's broken, there's a reason. There's a bug, or you're doing it wrong. Fix it yourself, drop it into a github fork, and submit back.

Beware old plugins
They don't work well. And they sit around on Google sucking up time and effort. Acts as paranoid is one. They look legit, with famous names who created them. Don't fall for it. Insist on using code that has been updated recently. Rails changes pretty fast, and plugins that don't get updated will waste your time, cause random bugs, and basically make your life crap.

Github is new on the scene and has totally revolutionized Rails. When in doubt, search Github. If it's not on Github, it's probably dead/not-maintained. Be wary.

Beware old anything
Actually, if this blog post is older than even 6 months or 1 year -- you might want to go elsewhere. Rails moves fast. What's hot and "must have" in Rails now may be totally a piece of crap / barely functioning garbage later. Same with any blog posts. Be super wary of the Rails wiki. There be dragons -- I mean, really stuff that references Rails 1.2.6 or earlier!

And that's a wrap.
There's tons more stuff, but this is a pretty decent list of stuff to watch out for. If you have any suggestions for other things I missed, or questions, please do leave a comment below!

If you liked this article, please try posterous.com and/or follow me on twitter at @posterous and @garrytan!

Posterous is also hiring Rails engineers for full time positions in San Francisco. We're a small team of all hardcore developers and looking for like minded folks to join up. Well-funded by the top-tier VC's and angels. We grew over 10x in 2009!

The Solved Mystery of the Crashing Starling

Starling is the simple queue server that backs Twitter, and Workling is the great queue framework by Rany Keddo that is built on top of Starling (and works with other queue servers, e.g. RabbitMQ). It lets you throw a job on a queue server, and have something consume those jobs on a different process / different machine. This is great for long-running tasks like sending emails, or creating denormalized data, which is what we use StarlingWorkling for.

But we ran into catastrophic crashes with Starling (memory use bloat over the course of a day to 1 GB of RAM, no longer responding to queries). At first, we were confused, and a restart of Starling would always solve the problem. This was fine since jobs that hadn't been completed get flushed to disk and re-read from disk with Starling. Starling is absurdly simple, actually -- it's just a modified memcached server that lets you "throw a job on there" -- it stores a job id, a hash object of parameters, and that's about it. Using the memcache 'get' command, the workling client de-queues the job (which destroys it from the queue), and using the memcache 'put' command your rails proc can put one on there. First in, first out.

After a while, we realized the problem was actually with the way we used Starling/Workling. Workling has a "return store" that allows you to pass a message back from your running queue that something is "done" -- or it can be used for progress as well.

The offending code was here, inside each one of our asynchronous worker code:
    Workling.return.set(options[:uid], "DONE")   

This looked innocuous enough. You want to be able to know when a given offline job is done, right? However, the Starling return store actually ends up just queuing another "job", with the key being unique to options[:uid]. But we weren't ever "de-queueing" that job -- meaning for many of our jobs we never retrieved the status of it, and it would always be stored and never destroyed. So thus, Starling was getting absolutely filled with these one-off return jobs (tens of thousands of them) throughout the course of a day. After a day or so, it would just giving up the ghost.

We swapped the code to this, and now we've had 100% uptime for weeks and weeks. Whenever we want to get done status, we also pass in the flag 'set_return_store' to true.
    Workling.return.set(options[:uid], "DONE") if options[:set_return_store]   

Thank God for open source -- if we couldn't dig through the source, we'd never be able to truly understand what was happening. That's one of the biggest lessons to learn when working deeply with open source: search Google, but when nothing comes up, be sure to read the source and understand what's really happening. Nobody's going to hold your hand through this.

XSS dangers: A classic reason to use image_tag instead of IMG tags in your rails view code.

This code seems fine in rails view code, right?

<img src="<%=@image%>" alt='<%=label%>'>

Wrong. What if someone drops this into your user generated content label?


Bad times ensue, because users can inject their own evil JS that actually BREAKS OUT of the tag and run cross-site scripts. Always use link_to and image_to when dealing with user generated text. It'll automatically strip out text that otherwise would break out of the tag string. Special thanks to @Stephen on twitter for being a great security guru.

RSpec's should_receive doesn't act quite the way you would guess!

For anyone more experienced with RSpec, this would be pretty obvious, but it took me some time to realize this was what was causing some crazy behavior that I just plain could not understand.

I'm writing RSpec tests for some code that lives in static methods. For illustration purposes, lets say it looks like this:

module Posterous
   class Foo

     def self.do_something

     def self.do_something_delegate
        puts "Hello world!"

    def complicated_stuff

So then I write a simple RSpec test that does this:

 it "should call do_something and its delegate" do 
So what do you expect will happen? I expected that the test would pass -- in theory, nothing about the term should_receive screams to me that the method invocation would change at all.

That's just not how it works, however. Because should_receive is a part of the Spec::Mocks lib of RSpec, it actually DOES cause the method "do_something_delegate" to act like a mock. A mock model is an actual stand-in, and in this case, because we don't define a return value. We could return a mock value here if we called Posterous::Foo.should_receive(:do_something_delegate).and_return("my value here").

So do_something_delegate ends up becoming a no-op (and that passes), but then it fails on Posterous::Foo.should_receive(:complicated_stuff). This makes sense, since complicated_stuff was INSIDE of do_something_delegate, and since do_something_delegate is mocked and never actually called. Hence complicated_stuff is skipped entirely and that test fails.

And actually, that's what you want. Unit tests at this level should be simple, and should test only one layer of your call stack at any given point. Integration/controller tests can be used for end-to-end testing, but this kind of mock behavior is ideal because it forces you to only test that method call and the behavior contained within.

Sometimes the best lessons come from banging your head against a problem until you radically change the way you think.