Just underscores how hard it really is to scale. Happens to the best of us. :D
Here is a list of projects that could potentially replace a group of relational database shards. Some of these are much more than key-value stores, and aren’t suitable for low-latency data serving, but are interesting none-the-less.
Name Language Fault-tolerance Persistence Client Protocol Data model Project Voldemort Java partitioned, replicated, read-repair Pluggable: BerkleyDB, Mysql Java API Structured / blob / text Ringo Erlang partitioned, replicated, immutable Custom on-disk (append only log) HTTP blob Scalaris Erlang partitioned, replicated, paxos In-memory only Erlang, Java, HTTP blob Kai Erlang partitioned, replicated? On-disk Dets file Memcached blob Dynomite Erlang partitioned, replicated Pluggable: couch, dets Custom ascii, Thrift blob MemcacheDB C replication BerkleyDB Memcached blob ThruDB C++ Replication Pluggable: BerkleyDB, Custom, Mysql, S3 Thrift Document oriented CouchDB Erlang Replication, partitioning? Custom on-disk HTTP, json Document oriented (json) Cassandra Java Replication, partitioning Custom on-disk Thrift Bigtable meets Dynamo HBase Java Replication, partitioning Custom on-disk Custom API, Thrift, Rest Bigtable Hypertable C++ Replication, partitioning Custom on-disk Thrift, other Bigtable
Awesome review of a bunch of possible Key-value store pairs. Tokyo Tyrant conspicuously missing.
SELECT concat(table_schema,'.',table_name), concat(round(table_rows/1000000,2),'M') rows, concat(round(data_length/(1024*1024*1024),2),'G') DATA, concat(round(index_length/(1024*1024*1024),2),'G') idx, concat(round((data_length+index_length)/(1024*1024*1024),2),'G') total_size, round(index_length/data_length,2) idxfrac FROM information_schema.TABLES ORDER BY data_length+index_length DESC LIMIT 20;
Pretty useful way of seeing how big your datasets are. Thanks to Mike at Backtype for sharing this.
It's all the rage to be using non-sql-based storage software these days. Memcache is great for caching, but what happens when it falls out of cache? Enter Tokyo Tyrant / MemcacheDB.
Sometimes all you really do need is key-value pairs. Is Tokyo Tyrant the answer? Looks like it's being used under heavy load for a few other production sites. Though MemcacheDB is purportedly being used by Digg.
There's interestingly very little online that I've found about MemcacheDB vs Tokyo Tyrant. Anyone have some info to share?
This article explains iframe-to-iframe communication, when the iframes come from different domains.
Great article, and details a really cool hack to get cross-domain frames to work. It's one dark recess of web technology that is actually quite useful, but is rarely explained in such proper and interesting detail.
Garry Tan, cofounder of Posterous, lists 12 lessons for scaling that apply to more than just Rails.
This blog highscalability.com is basically the gold standard must-read blog for startups of any kind to deal with large scale. What a great honor!
Also: mental note to blog in list format more often. Those seem to go over so incredibly well. =)
There are a bunch of basic functional elements to building out a popular Rails app that I've never really seen explained in one place, but we had to learn the hard way while building Posterous. Here's a rundown of what we've learned, in the hopes that some Google linkjuice may bring an intrepid wanderer to this forlorn part of the woods and help you out.
S3 is awesome. Yes, you can host everything off S3. Is it a little more expensive? Probably. But if you're engineer constrained (and if you're a startup, you absolutely are) -- set it and forget it. If you absolutely must have local storage across app servers, then MogileFS, GFS, or HDFS or even NFS (yuck) are possibilities. For alternatives to S3, Mosso is supposed to be good too.
Then what happens? Sad town happens. 1 in 4 requests after Request A will go to port 8000, and all of those requests will wait in line as that mongrel chugs away at the slow request. Effective wait time on 1/4th of your requests in that 10 second period may be as long as 10 seconds, even if normally it should only take 50msec!
Enter the wonderful mongrel proctitle. Now, you can see exactly what is blocking your mongrels. I keep this on a watch in a terminal at all times. It's what I look at immediately if our constant uptime tests tell us something's wrong. Super useful.
Posterous is also hiring Rails engineers for full time positions in San Francisco. We're a small team of all hardcore developers and looking for like minded folks to join up. Well-funded by the top-tier VC's and angels. We grew over 10x in 2009!
Imagine Kevin Rose, the founder of Digg, who at the time of this presentation had 40,000 followers. If Kevin diggs just once a day that's 40,000 writes. As the most active diggers are the most followed it becomes a huge performance bottleneck. Two problems appear.
You can't update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that.
The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That's 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers.
With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB's own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it's easy to see why Joe was so excited about MemcacheDB's ability to handle their digg deluge.
Posterous has this exact problem when handling subscriptions. We currently denormalize new posts for subscription lists using after_save hooks that translate new posts into new notification records on the backend. But if we can plug this instead into MemcacheDB instead, needless DB writes can be avoided.
libcurl3 - 7.18.2-1ubuntu4
libcurl3-gnutls - 7.18.2-1ubuntu4
libcurl4-openssl-dev - 7.18.2-1ubuntu4
Google to the rescue!
I just had to run sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev and my install of the fine Ruby library curl was just fine.