Considerations when Sharding

Whilst this is talking about our use of MongoDB there is relevance here for any sharding discussion.

We currently use MongoDB at Conversocial for our main content store. We're now starting to think about how we shard as the main store is getting pretty large (150 million documents across 300gb).

The easy answer for us is to use MongoDB's built-in sharding. MongoDB divides all the documents into equally sized chunks and distributes these evenly around all shards. This would cause the documents to be spread fairly even manner.

The difficult question then is what to shard on? Some combination of customer account and timestamps would ensure that all customer accounts are spread roughly equally around all our boxes and give us both read and write scaling.

Wonderful!

Except, if a shard goes down (assuming worst case scenarios) then ALL customers are affected to some extent. Reporting the temporary disappearance of some of their data will be quite difficult.

Instead we could limit the shard key to just customer account. This changes our failure scenario such that a subset of customers would temporarily lose access to ALL their data but the number of customers affected would be reduced. This scenario is also far simpler to explain to customers. A downside of this is that individual customers get no benefit from reads and writes being distributed around more servers and potentially running faster as a result.

This sounds good - all our customers data will be evenly distributed (in terms of storage) across our shards and the failure scenario will only affect subsets of customers. Unfortunately, not all customers are created equal. Some customers have a lot of data but don't do a lot with it. Some customers have a lot of data AND do a lot with it. We can easily imagine scenarios where one server with 100gb of data on it has a far lower number of queries than another server with 100gb of data - simply because the latter server has more demanding customers placed on it. In general, this shouldn't happen as customers will be distributed randomly and so the heavy hitters should be distributed fairly evenly. In practice, we must consider worst case scenarios.

To top this off - we also have customers that have more demanding performance requirements (and increasingly, regulatory requirements in terms of where there data is and whether it is encrypted).

Simply sharding based on storage won't deal with this.

Instead, we're thinking of moving our sharding into our application and manually sharding based on how demanding a customer's usage is. The advantage of this is we can then place customers data in different regions if needed (an added boon of this is we can place data closer to where they are and give them better performance). The disadvantage is that we're going to have to do all the sharding work ourselves.

We've still not made our decision. I put this up here simply as a reminder to others that when deciding to shard you must consider your business needs as well as trying to get bigger read/write volumes.

Discussion

blog comments powered by Disqus

Colin Howe

I'm Colin. I like coding, ultimate frisbee and startups. I am VP of engineering at Conversocial