Why fair-proxy balancer for nginx is not a viable solution for mongrel long query queues. It's actually a disaster.

Fair-proxy balancer patch for nginx - this can keep new connections away from busy Mongrels, but if all (concurrency-disabled) Mongrels in your pool get busy, then you'll still get 502 errors.

We just hit this this morning when we tried to flip on our new and improved HAProxy + Nginx + mongrel clusters architecture. Throughput was fantastic, but I couldn't figure out why I was getting intermittent 502 errors.

Doesn't that swap one relatively annoying problem with a total fail? How is fair proxy balancer (an nginx mod) even useful in the least if when mongrels are all busy, nginx just returns a 502 Bad Gateway that completely hoses the user?

That's actually INFINITELY worse than just making someone wait 500msec longer than they should for a fast request.

I'll post again about our setup when I have all the ducks in order again, but unless something is wrong with my setup, nginx + fair proxy balancer is utterly unusable in production systems.

I'll be switching to serving nginx on the frontend for static content, with a proxy from nginx to my mongrel cluster being passed through HAProxy. HAProxy will then use the maxconn 1 flag to provide the queue-busting load balancing algorithm I wanted in the first place.

Now what to do about load balancing static content? nginx upstream reverse proxy to another cluster of nginx's? *wince*

Edit: The creator of nginx-ey-balancer contacted me and said hey, use this instead. I already switched to using haproxy for mongrel reverse proxy, but wanted to make sure to link to the github project. Looks to be a non-screwed-up version of what the fair proxy balancer was supposed to be. =) Thanks ry!

11 distributed key-value stores, as reviewed by last.fm engineers

Here is a list of projects that could potentially replace a group of relational database shards. Some of these are much more than key-value stores, and aren’t suitable for low-latency data serving, but are interesting none-the-less.

Name Language Fault-tolerance Persistence Client Protocol Data model
Project Voldemort Java partitioned, replicated, read-repair Pluggable: BerkleyDB, Mysql Java API Structured / blob / text
Ringo Erlang partitioned, replicated, immutable Custom on-disk (append only log) HTTP blob
Scalaris Erlang partitioned, replicated, paxos In-memory only Erlang, Java, HTTP blob
Kai Erlang partitioned, replicated? On-disk Dets file Memcached blob
Dynomite Erlang partitioned, replicated Pluggable: couch, dets Custom ascii, Thrift blob
MemcacheDB C replication BerkleyDB Memcached blob
ThruDB C++ Replication Pluggable: BerkleyDB, Custom, Mysql, S3 Thrift Document oriented
CouchDB Erlang Replication, partitioning? Custom on-disk HTTP, json Document oriented (json)
Cassandra Java Replication, partitioning Custom on-disk Thrift Bigtable meets Dynamo
HBase Java Replication, partitioning Custom on-disk Custom API, Thrift, Rest Bigtable
Hypertable C++ Replication, partitioning Custom on-disk Thrift, other Bigtable

Awesome review of a bunch of possible Key-value store pairs. Tokyo Tyrant conspicuously missing.

Useful query: MySQL table size, number of rows, and size of indexes




SELECT concat(table_schema,'.',table_name),
concat(round(table_rows/1000000,2),'M') rows,
concat(round(data_length/(1024*1024*1024),2),'G') DATA,
concat(round(index_length/(1024*1024*1024),2),'G') idx,
concat(round((data_length+index_length)/(1024*1024*1024),2),'G') total_size,
round(index_length/data_length,2) idxfrac 
FROM information_schema.TABLES 
ORDER BY data_length+index_length DESC LIMIT 20;


Pretty useful way of seeing how big your datasets are. Thanks to Mike at Backtype for sharing this.

Currently contemplating Tokyo Tyrant as a distributed key/value hash DB... or MemcacheDB?

It's all the rage to be using non-sql-based storage software these days. Memcache is great for caching, but what happens when it falls out of cache? Enter Tokyo Tyrant / MemcacheDB.

Sometimes all you really do need is key-value pairs. Is Tokyo Tyrant the answer? Looks like it's being used under heavy load for a few other production sites. Though MemcacheDB is purportedly being used by Digg.

There's interestingly very little online that I've found about MemcacheDB vs Tokyo Tyrant. Anyone have some info to share?

High Scalability blog picked up my last article about Rails scaling. AWESOME.

Garry Tan, cofounder of Posterous, lists 12 lessons for scaling that apply to more than just Rails.

This blog highscalability.com is basically the gold standard must-read blog for startups of any kind to deal with large scale. What a great honor!

Also: mental note to blog in list format more often. Those seem to go over so incredibly well. =)

Building and Scaling a Startup on Rails: 12 Things We Learned the Hard Way

There are a bunch of basic functional elements to building out a popular Rails app that I've never really seen explained in one place, but we had to learn the hard way while building Posterous. Here's a rundown of what we've learned, in the hopes that some Google linkjuice may bring an intrepid wanderer to this forlorn part of the woods and help you out.

Static Storage
S3 is awesome. Yes, you can host everything off S3. Is it a little more expensive? Probably. But if you're engineer constrained (and if you're a startup, you absolutely are) -- set it and forget it. If you absolutely must have local storage across app servers, then MogileFS, GFS, or HDFS or even NFS (yuck) are possibilities. For alternatives to S3, Mosso is supposed to be good too.

Images, files, whatever. Just drop it there. People say a lot of stuff about the Cloud, but it's real and a game changer for anyone doing user generated content.


HTTP Cache Control
The HTTP protocol lets you tell browsers what static content they can cache. You set this in apache.  Rails automatically will put timestamps in the IMG / javascript / CSS tags, assuming you're using the helpers. The Firefox plugin YSlow coupled with Firebug are your friends here. The improvement is significant and well worth your time, especially if you add gzip'ing. 100KB initial page load can be brought down to 5K (just the HTML file) on subsequent clicks around your site.


Search
You're not going to run full text search out of your DB. It's totally not worth it to roll anything custom here. The smart money is on Sphinx with the ThinkingSphinx plugin is probably your best bet. If you have more than one app server, you'll want to use this. Alternatively, Solr with Acts as Solr can be used if you're a Java geek / have Lucene/Solr experience previously.


Storage engine matters, and you should probably use InnoDB

MyISAM is marginally faster for reads, but InnoDB will make you more crash resistant and will not lock tables on writes. Read about the difference, because when your servers are on fire, you will realize MySQL feels like a pretty thin layer of goop on top of your storage engine. MyISAM is actually the default on MySQL, which makes sense for most crappy phpBB installations -- but probably not good enough for you. The default can hurt you.

Oh yeah, and if you can start with some replication in place, do it. You'll want at least one slave for backups anyway.

Fix your DB bottlenecks with query_reviewer and New Relic
This basically saves your ass completely. Everyone complains that Rails is slow. Rails is not slow, just like Java Swing is not slow. Rails makes it easy to shoot yourself in the face. If you do follow-the-textbook-example bumbling around with Rails ActiveRecord objects, you will end up with pages that drive 100 queries and take several seconds to return.

Above is a screenshot from query_reviewer. It tells you every single query being run, and alerts you to things that use temporary tables, file sorts and/or just damn slow queries.

In a nutshell, you need indexes to avoid full table scans. The traditional way is to run EXPLAIN manually on queries coming out of your dev log. Query_reviewer lets you see it all right there in the left corner of your web browser. It's brilliant. You also need to eager load associations that will use in your views by passing :include to your ActiveRecord find method call, so that you can batch up SQL queries instead of destroying your DB server with 100 queries per dynamic page.

New Relic is new for us, but it helps us see what is really happening on our production site. If your site is on fire, it's a freaking beautiful gift from the heavens above. You'll see exactly what controllers are slow, which servers in your cluster, how load is on all your machines, and which queries are slow.

Memcache later
If you memcache first, you will never feel the pain and never learn how bad your database indexes and Rails queries are. What happens when scale gets so big that your memcache setup is dying? Oh, right, you're even more screwed than you would have been if you got your DB right in the first place. Also, if this is your first time doing scaling Rails / a db-driven site, there's only one way to learn how, and putting it off til later probably isn't the way. Memcache is like a bandaid for a bullet hole -- you're gonna die.



You're only as fast as your slowest query.
If you're using nginx or Apache as a load balancer in front of a pack of mongrels (or thins or whatever else is cool/new/hip), then each of those mongrels acts like a queue. The upshot is that if you EVER have a request that takes a long time to finish, you're in a world of hurt. So say you have 4 mongrels, and Request A comes in to port 8000 and it takes 10 seconds. The load balancer is naive and keeps passing requests to Port 8000 even though that port is busy. (Note: This might help, but we don't use it)

Then what happens? Sad town happens. 1 in 4 requests after Request A will go to port 8000, and all of those requests will wait in line as that mongrel chugs away at the slow request. Effective wait time on 1/4th of your requests in that 10 second period may be as long as 10 seconds, even if normally it should only take 50msec!

Enter the wonderful mongrel proctitle. Now, you can see exactly what is blocking your mongrels. I keep this on a watch in a terminal at all times. It's what I look at immediately if our constant uptime tests tell us something's wrong. Super useful.

The answer is: a) run some mongrels dedicated to slow running jobs (meh) or b) run Phusion Passenger, or c) run slow stuff offline... which leads us to...

Offline Job Queues
So you gotta send some emails. Or maybe denormalize your DB. Or resize photos, or transcode video or audio. But how do you do it in the 200msec that you need to return a web request? You don't. You use Workling or Delayed Job or nanite. It'll happen outside of your mongrels and everyone will be happier.

I don't know why people don't talk about this more, because if you run a site that basically does anything, you need something like this. It *should* be a part of Rails, but isn't. It isn't a part of Rails in the same way that SwingWorker in Java wasn't a part of Java Swing core like forever, even though it absolutely had to be.

If you don't monitor it, it will probably go down, and you will never know.
Test your site uptime, not just ping but actual real user requests that hit the DB. Sure, you could use pingdom if you're lazy, but it seriously takes like 10 lines of ruby code to write an automated daemon that runs, does a user action and checks that your site is not hosed. open-uri is your friend. You don't know if you're up if you're not checking. Do not tolerate downtime.

Also, use god for mongrel and process monitoring. Mongrels die or go crazy. You gotta keep them in their place. (What's funny is that god leaks memory over time with Ruby 1.8.6 *sigh*). Munin, monit, and nagios are also great to have.

Keep an eye on your resources -- IO ok? Disk space? It's the worst thing every to have a site crash because you forgot to clean the logs or you ran out of disk space. Make cronjobs for cleaning all logs and temp directories, so that you can set it and forget it. Because you will forget, until you are reminded in the worst way.

Read the source, and cut back the whining
You will learn more reading the source and debugging / fixing bugs in plugins and sometimes Rails itself than a) complaining on a mailing list or b) whining about shit on your twitter. It's Ruby open source code -- if it's broken, there's a reason. There's a bug, or you're doing it wrong. Fix it yourself, drop it into a github fork, and submit back.


Beware old plugins
They don't work well. And they sit around on Google sucking up time and effort. Acts as paranoid is one. They look legit, with famous names who created them. Don't fall for it. Insist on using code that has been updated recently. Rails changes pretty fast, and plugins that don't get updated will waste your time, cause random bugs, and basically make your life crap.

Github is new on the scene and has totally revolutionized Rails. When in doubt, search Github. If it's not on Github, it's probably dead/not-maintained. Be wary.

Beware old anything
Actually, if this blog post is older than even 6 months or 1 year -- you might want to go elsewhere. Rails moves fast. What's hot and "must have" in Rails now may be totally a piece of crap / barely functioning garbage later. Same with any blog posts. Be super wary of the Rails wiki. There be dragons -- I mean, really stuff that references Rails 1.2.6 or earlier!

And that's a wrap.
There's tons more stuff, but this is a pretty decent list of stuff to watch out for. If you have any suggestions for other things I missed, or questions, please do leave a comment below!

If you liked this article, please try posterous.com and/or follow me on twitter at @posterous and @garrytan!

Posterous is also hiring Rails engineers for full time positions in San Francisco. We're a small team of all hardcore developers and looking for like minded folks to join up. Well-funded by the top-tier VC's and angels. We grew over 10x in 2009!

MemcacheDB looks like a great place to store denormalized subscriptions.

Imagine Kevin Rose, the founder of Digg, who at the time of this presentation had 40,000 followers. If Kevin diggs just once a day that's 40,000 writes. As the most active diggers are the most followed it becomes a huge performance bottleneck. Two problems appear.

You can't update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that.

The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That's 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers.

With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB's own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it's easy to see why Joe was so excited about MemcacheDB's ability to handle their digg deluge.

Posterous has this exact problem when handling subscriptions. We currently denormalize new posts for subscription lists using after_save hooks that translate new posts into new notification records on the backend. But if we can plug this instead into MemcacheDB instead, needless DB writes can be avoided.