Why are we partitioning our dataset and how does it help us to achieve scalability of our application?
It is difficult, if not nearly impossible, to massively scale your data layer when the data is limited to residing on a single server. Whether the limiting factor is a hardware cost issue, or you’ve simply equipped your server with the highest performing hardware possible, we ultimately find ourselves up against a wall – there are inherent limitations to what is currently possible by vertical scaling our hardware, it is a simple matter of fact. If we instead take our dataset schema, duplicate it onto multiple servers (shards), and split (or partition) the data on the original single server into equal portions distributed amongst our new set of servers (shards), we can parallelize our query load across them. Adding more servers (shards) to our existing set of servers results in near limitless scalability potential. The theory behind how this works is simple, but the execution is a fair bit more complex thanks to a series of scale-specific issues one encounters along the way.
Great article, hat tip to Mike Montano from Backtype. We are both going through massive scale growing pains. Articles like this are like a balm for burn victims. ;-)