Scaling Facebook vs Scaling Digg: It's a question of disk vs RAM

Facebook takes a Pull on Demand approach. To recreate a page or a display fragment they run the complete query. To find out if one of your friends has added a new favorite band Facebook actually queries all your friends to find what's new. They can get away with this but because of their awesome infrastructure.  

But if you've ever wondered why Facebook has a 5,000 user limit on the number of friends, this is why. At a certain point it's hard to make Pull on Demand scale.

Another approach to find out what's new is the Push on Change model. In this model when a user makes a change it is pushed out to all the relevant users and the changes (in some form) are stored with each user. So when a user want to view their updates all they need to access is their own account data. There's no need to poll all their friends for changes.

Really interesting article at High Scalability on ways to approach scaling your data store.

We use push on change as well, particularly for your reading list subscriptions. To be honest, it's cheaper. You use disk space to pre-compute things that would be expensive to ask the database repeatedly. It allows you to just add disk -- even though disk is orders of magnitudes slower than RAM.

The Facebook approach is *really hard* to get right. It's costly because so much info just has to live in RAM, and could be one reason why it's much harder for Facebook to reach profitability than most other sites. If you have to add RAM to keep all user data in cache, that's a lot of hardware to keep going.

But when it comes to realtime, Facebook is as realtime as they come. They are some real engineering badasses.

4 responses
Really interesting info. Never knew that.
"You use disk space to pre-compute things that would be expensive to ask the database repeatedly."

Facebook may have some badass engineering going on, but there's some simple genius in your way too.

So in "Push on Change", you have to:

* delete the reading items from their reading list section when they unsubscribe,
* then add them back if they re-subscribe?

This seems like the popular option in alternative dbs like CouchDB and Redis. (Both have some options when saving the data from memory to disk to prevent slowdown from constant disk writing.)

1 visitor upvoted this post.