Building and Scaling a Startup on Rails: 12 Things We Learned the Hard Way

There are a bunch of basic functional elements to building out a popular Rails app that I've never really seen explained in one place, but we had to learn the hard way while building Posterous. Here's a rundown of what we've learned, in the hopes that some Google linkjuice may bring an intrepid wanderer to this forlorn part of the woods and help you out.

Static Storage
S3 is awesome. Yes, you can host everything off S3. Is it a little more expensive? Probably. But if you're engineer constrained (and if you're a startup, you absolutely are) -- set it and forget it. If you absolutely must have local storage across app servers, then MogileFS, GFS, or HDFS or even NFS (yuck) are possibilities. For alternatives to S3, Mosso is supposed to be good too.

Images, files, whatever. Just drop it there. People say a lot of stuff about the Cloud, but it's real and a game changer for anyone doing user generated content.


HTTP Cache Control
The HTTP protocol lets you tell browsers what static content they can cache. You set this in apache.  Rails automatically will put timestamps in the IMG / javascript / CSS tags, assuming you're using the helpers. The Firefox plugin YSlow coupled with Firebug are your friends here. The improvement is significant and well worth your time, especially if you add gzip'ing. 100KB initial page load can be brought down to 5K (just the HTML file) on subsequent clicks around your site.


Search
You're not going to run full text search out of your DB. It's totally not worth it to roll anything custom here. The smart money is on Sphinx with the ThinkingSphinx plugin is probably your best bet. If you have more than one app server, you'll want to use this. Alternatively, Solr with Acts as Solr can be used if you're a Java geek / have Lucene/Solr experience previously.


Storage engine matters, and you should probably use InnoDB

MyISAM is marginally faster for reads, but InnoDB will make you more crash resistant and will not lock tables on writes. Read about the difference, because when your servers are on fire, you will realize MySQL feels like a pretty thin layer of goop on top of your storage engine. MyISAM is actually the default on MySQL, which makes sense for most crappy phpBB installations -- but probably not good enough for you. The default can hurt you.

Oh yeah, and if you can start with some replication in place, do it. You'll want at least one slave for backups anyway.

Fix your DB bottlenecks with query_reviewer and New Relic
This basically saves your ass completely. Everyone complains that Rails is slow. Rails is not slow, just like Java Swing is not slow. Rails makes it easy to shoot yourself in the face. If you do follow-the-textbook-example bumbling around with Rails ActiveRecord objects, you will end up with pages that drive 100 queries and take several seconds to return.

Above is a screenshot from query_reviewer. It tells you every single query being run, and alerts you to things that use temporary tables, file sorts and/or just damn slow queries.

In a nutshell, you need indexes to avoid full table scans. The traditional way is to run EXPLAIN manually on queries coming out of your dev log. Query_reviewer lets you see it all right there in the left corner of your web browser. It's brilliant. You also need to eager load associations that will use in your views by passing :include to your ActiveRecord find method call, so that you can batch up SQL queries instead of destroying your DB server with 100 queries per dynamic page.

New Relic is new for us, but it helps us see what is really happening on our production site. If your site is on fire, it's a freaking beautiful gift from the heavens above. You'll see exactly what controllers are slow, which servers in your cluster, how load is on all your machines, and which queries are slow.

Memcache later
If you memcache first, you will never feel the pain and never learn how bad your database indexes and Rails queries are. What happens when scale gets so big that your memcache setup is dying? Oh, right, you're even more screwed than you would have been if you got your DB right in the first place. Also, if this is your first time doing scaling Rails / a db-driven site, there's only one way to learn how, and putting it off til later probably isn't the way. Memcache is like a bandaid for a bullet hole -- you're gonna die.



You're only as fast as your slowest query.
If you're using nginx or Apache as a load balancer in front of a pack of mongrels (or thins or whatever else is cool/new/hip), then each of those mongrels acts like a queue. The upshot is that if you EVER have a request that takes a long time to finish, you're in a world of hurt. So say you have 4 mongrels, and Request A comes in to port 8000 and it takes 10 seconds. The load balancer is naive and keeps passing requests to Port 8000 even though that port is busy. (Note: This might help, but we don't use it)

Then what happens? Sad town happens. 1 in 4 requests after Request A will go to port 8000, and all of those requests will wait in line as that mongrel chugs away at the slow request. Effective wait time on 1/4th of your requests in that 10 second period may be as long as 10 seconds, even if normally it should only take 50msec!

Enter the wonderful mongrel proctitle. Now, you can see exactly what is blocking your mongrels. I keep this on a watch in a terminal at all times. It's what I look at immediately if our constant uptime tests tell us something's wrong. Super useful.

The answer is: a) run some mongrels dedicated to slow running jobs (meh) or b) run Phusion Passenger, or c) run slow stuff offline... which leads us to...

Offline Job Queues
So you gotta send some emails. Or maybe denormalize your DB. Or resize photos, or transcode video or audio. But how do you do it in the 200msec that you need to return a web request? You don't. You use Workling or Delayed Job or nanite. It'll happen outside of your mongrels and everyone will be happier.

I don't know why people don't talk about this more, because if you run a site that basically does anything, you need something like this. It *should* be a part of Rails, but isn't. It isn't a part of Rails in the same way that SwingWorker in Java wasn't a part of Java Swing core like forever, even though it absolutely had to be.

If you don't monitor it, it will probably go down, and you will never know.
Test your site uptime, not just ping but actual real user requests that hit the DB. Sure, you could use pingdom if you're lazy, but it seriously takes like 10 lines of ruby code to write an automated daemon that runs, does a user action and checks that your site is not hosed. open-uri is your friend. You don't know if you're up if you're not checking. Do not tolerate downtime.

Also, use god for mongrel and process monitoring. Mongrels die or go crazy. You gotta keep them in their place. (What's funny is that god leaks memory over time with Ruby 1.8.6 *sigh*). Munin, monit, and nagios are also great to have.

Keep an eye on your resources -- IO ok? Disk space? It's the worst thing every to have a site crash because you forgot to clean the logs or you ran out of disk space. Make cronjobs for cleaning all logs and temp directories, so that you can set it and forget it. Because you will forget, until you are reminded in the worst way.

Read the source, and cut back the whining
You will learn more reading the source and debugging / fixing bugs in plugins and sometimes Rails itself than a) complaining on a mailing list or b) whining about shit on your twitter. It's Ruby open source code -- if it's broken, there's a reason. There's a bug, or you're doing it wrong. Fix it yourself, drop it into a github fork, and submit back.


Beware old plugins
They don't work well. And they sit around on Google sucking up time and effort. Acts as paranoid is one. They look legit, with famous names who created them. Don't fall for it. Insist on using code that has been updated recently. Rails changes pretty fast, and plugins that don't get updated will waste your time, cause random bugs, and basically make your life crap.

Github is new on the scene and has totally revolutionized Rails. When in doubt, search Github. If it's not on Github, it's probably dead/not-maintained. Be wary.

Beware old anything
Actually, if this blog post is older than even 6 months or 1 year -- you might want to go elsewhere. Rails moves fast. What's hot and "must have" in Rails now may be totally a piece of crap / barely functioning garbage later. Same with any blog posts. Be super wary of the Rails wiki. There be dragons -- I mean, really stuff that references Rails 1.2.6 or earlier!

And that's a wrap.
There's tons more stuff, but this is a pretty decent list of stuff to watch out for. If you have any suggestions for other things I missed, or questions, please do leave a comment below!

If you liked this article, please try posterous.com and/or follow me on twitter at @posterous and @garrytan!

Posterous is also hiring Rails engineers for full time positions in San Francisco. We're a small team of all hardcore developers and looking for like minded folks to join up. Well-funded by the top-tier VC's and angels. We grew over 10x in 2009!

views

Tags

55 responses
Pretty impressive the wide variety of stuff that you have to keep your eye on just to *run* the site, much less make changes to it!

Make me realize that this young grasshopper has much to learn ...

Excellent article, spot on as far as I can tell. We're not so far down the scaling rabbit-hole yet, and our architecture is subtly different, but this all rings true.

One thing about the band-aid... I consider database indexes to be similarly band-aid-y. Why? Because you almost never remove them once you've added them, you can't add them ad infinitum (eventually they'll slow your writes to a crawl) and they hide slow queries that, sometimes, should not be happening at all.

My first approach when faced with a slow query is to figure out if the query can be avoided completely (surprisingly often, it can). If it can't, then I'll consider an index... but only as a second-to-last resort, just before caching.

This was seriously the most useful blog post I've read all month.
You should consider HAproxy between your front-end web servers and application servers (Mongrel, Thin, whatever). We've been using it for a long time, and it works great.

I covered it in a screencast a few months ago:
http://www.37signals.com/svn/posts/1073-nuts-bolts-haproxy

amazing write up. also agree, this is one of the most useful tech articles i've read in a very long time.
Excellent post!

Wrt. static content that is small files you should definitely consider Amazon's CloudFront CDN for better latency than Amazon's S3.

Re: You're only as fast as your slowest query. Have you yet tried any performance testing with using the Apache 2.2 load balancing algorithm bybusyness?
Great post. ha_proxy with max-conn=1 would be a better load balancing option (as someone earlier said).

Also, earlier this month I put out a free series of screencasts on Scaling Rails which may be useful for people reading this article:

http://railslab.newrelic.com/scaling-rails

That was a really good post. Hope to read more of this kind.
Thanks for the awesome info, Garry. Just pushed out a copy of the rails caching guide and was wondering if you had specific examples of memcached failing? I've kinda been leaning towards pushing more and more into the cache, since it turns out to be a lot easier to scale than multiple master/slave db's.
Memcache doesn't fail -- it just hides problems. =)

But it absolutely makes sense to memcache before you shard. Partitioning is a bitch. Arguably you should memcache before you even start denormalizing, but that can be impossible depending on the particular problem.

Garry, great post. We've pushed back on memcache to date but we still expect to have to shard at some point... but we're using PostgreSQL. Anyone have war stories they'd like to share about PostgreSQL?
First, I'd like to recommend beanstalk (http://code.google.com/p/beanstalk/) over workling and the others, as it is way more robust in the face of worker breakage (and your workers /will/ die, and jobs really shouldn't get lost).

And now for some PostgreSQL war stories: I've got a few. I really like the query analysis tools; "explain select .." and "explan analyze select ..." are way better out of the box than what mysql comes with by default. It's a bit harder to tune for performance than Mysql, but if you pay careful attention to what explain and explain analyze tell you (and http://explain.depesz.com/new is a helpful visualizer for what happens in explain analyze), you can get very comparable speeds out of it, in addition to the confidence that your data is secure. More on that in the next paragraph.

PostgreSQL has a facility for Point-In-Time Recovery (PITR). This has saved asses. Also, the manual for it is an extremely good read: http://www.postgresql.org/docs/8.3/interactive/continuous-archiving.html
What it does is this: postgres keeps a log of what it will write to the database proper, and when that is full (at a "checkpoint"), will write to the database. It then throws away the write-ahead log. However, you can archive the write-ahead log. And if you do that, you can make live "tar" backups (or if you're using EBS, instant volume snapshots) of your postgresql database that can be restored to any point in time between the last snapshot and ~5 minutes before a failure occurred, at any time. This has saved my ass at least once, and it lets me sleep soundly at night.

The third thing about postgresql I want to praise is that it can help enforce constraints on your data. Sure, you can do validates_foo and do :dependent => :destroy to emulate foreign keys, but if your rails process dies while it does one of these operations (and believe me, it will die), you are left with inconsistent data.

Oh, and don't get me started on dates like the '0000-00-00 00:00:00' we had in our final mysql before the move... (-:

Good article. We've had to address almost all of these issues in our startup and our solutions are basically the same as yours.

Here is a MUST if you are using MySQL with InnoDB. Increase the buffer_pool_size! This one simple change has increased our DB performance by over 1000%. Not kidding at all!
It is explained here:
http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimizatio...

http://www.mysqlperformanceblog.com/2007/11/03/choosing-innodb_buffer_pool_size/

And if you aren't using NewRelic, signup now! It's a simple plugin, they have a free version, and you'll wish you had done it months ago.

I'm curious if your comment about not using a database for full text search takes postgres tsearch into account. Tsearch is a well-designed full text search implementation that performs on part with solr and requires no extra setup/programming if you use acts_as_tsearch.
edraut: Tsearch is cool, and we used it for a few months, but it
turned out to be just too big a resource hog for our usage scenario.
Updating the index as soon as data changes was just too expensive in
terms of IO produced, was not necessary most of the time, and querying
the index was not as fast or flexible as we would have needed.
 
Also, sphinx offers nice things such as substring matching, filters
(to reduce the result set by non-text-matching criteria) and sorted &
paginated results, all of them things we just couldn't get tsearch to
do either at all or quickly. Sorting and pagination, for example: It
will always do a bitmap index scan on the gin or gist index, computing
all results that match the text search criteria, then filter these
results (often millions, sometimes up to 90% of our data) by the other
criteria: Often, this took up to several minutes to finish on
not-unreasonably big data sets. On the same hardware, sphinx delivers
these same results in milliseconds, with the trade-off that records
younger than 15 minutes are not found.
 
This is a trade-off that we consider worth the cost.
Thanks for sharing your experience with tsearch. What you describe is pretty grim, it makes tsearch sound unusable. I've used it with success before, so I'm wondering if there aren't some configuration/tuning steps that might have alleviated the indexing problem. I've heard that moving the tsearch columns into a separate table can solve the indexing problem, but I never needed that.

Regardless, after your praise of sphinx I read about it, and I'm definitely going to give it a try. If it works as advertised, and if thinking-sphinx is solid, it's worth switching over.

Thanks for all the great advice!

Yeah, I'm a little surprised by the grim overtones actually, I've found the built in text search in Postgres scales much better than what you're describing here. I wonder, were you storing the tsvectors and indexing those, or building functional indexes on the data sets in question?

BTW, I wrote a more complete comment (which got too wordy to fit nicely as just a comment) describing differences in scaling when you're using Postgres on my own blog @ http://www.xzilla.net/blog/2009/Feb/re-axonflux-on-building-and-scaling-a-sta...

You *were* right to warn people about the Rails Wiki -- it was awfully out of date and had long been abandoned by the community.

That's why one of the first things the Rails Activists (Matt Aimonetti in particular) did was spearhead the creation of an all-new wiki, this time highly supported and overseen by the Activists as a key tool for our community.

We have an active Google Group to help organize and the wiki is already loaded with current information. It's even being translated into several languages.

Check out the new wiki at the usual URLhttp://wiki.rubyonrails.com/:

>
There seem to be some bugs in the text processor here, so I'll try posting the link again:

http://wiki.rubyonrails.com

I like to search for rails blogs and plugins using google advanced search and setting date restriction to "in the past year" or even month
@jonathanrwallace, we use PostgreSQL on RealTravel and it has been fantastic. I love MySQL too and have used both DBs for years. I have learned that its not as much about postgres vs mysql as much as its about table design, indexes, and simple queries.

--

The two things that I would add are:

1) Profile
Profile your pages using the new ./script/performance scripts Run pages at least 1000 if not more times to ensure that the app doesn't have any major memory leaks and that the most expensive code bubbles up in profile data. You will surprise yourself.

2) Write your own cache strategy if needed
Rails offers many caching solutions for action,page, etc. Plus there are some good data cache plug-ins, cache-money, cache_model, etc. We found that there were problems with most available solutions and we had to extend the rails action cache to suite our needs.

I would also echo that NewRelic is a very good tool. Between monit, nagios, and NewRelic, you have give your app some real handlebars.

Thanks for the post.

Thanks for sharing! Great article, very interesting. I have only dealt with JAVA and PHP up until now, but I am intrigued by Rails.
(Typo!) Your post is linked at www.DrinkRails.com
I see your mention of Pingdom and it being a waste of time. I've heard similar things, however, I use a great service called Yup It's Up. 25 monitors, for only 10 bucks a month, plus 1-month free trial. Check it out! http://yupitsup.com/
great article, thanks
No matter if some one searches for his necessary thing, so he/she wants to bee available that in detail, so that thing is maintained over here.
I constantly emailed this web site post page to all my friends, because if like to read it next my friends will too.
27 visitors upvoted this post.