MongoDB - Collection Per User Performance

Theory

On the MongoDB site there is a suggestion that collections can be used to cluster data and get better performance as a result.

The idea is that a different collection could be used for each user's data. Internally, MongoDB will use different extents for each collection (an extent is a contiguous block of memory). By doing this we guarantee that a user's data will be stored in mostly sequential blocks on disk - making it far easier to read in data if going to disk.

Performance - Experiment

The theory is sound, so what about performance?

In this experiment I used two c1.medium (high CPU) instances on AWS - one to run the test and one to be the server.

The test script inserted 5,000,000 ~1kb emails into the database spread over 10,000 users (using a triangle distribution with a mean of 5,000 to simulate high/low volume users). 10,000 queries were then performed for 20 e-mails from a random slice of time for a random user. I'll add a link to the script tonight when I get access to my home laptop.

There was an index on user/time on every collection.

There were three variants:

  • One collection for all users
  • Fifty collections with user's spread equally amongst the collections
  • One collection for each user

Performance - Results

Having a single collection for each user reduced the amount of storage needed (due to not needing to store/index the user ID):

  • 308mb vs 476mb for indices
  • 4044mb vs 4101mb for data

Query performance was as follows:

  • One collection for all users - 91.0ms / query
  • Fifty collections - 20.0ms / query
  • One collection per user - 13.2ms / query

Insert performance was:

  • One collection for all users - 5,167 inserts / sec
  • Fifty collections - 4,350 inserts / sec
  • One collection per user - 1,645 inserts / sec

Conclusions

Having a single collection per user was ~7x faster for reads, but, ~3x slower for writes.

Using fifty collections seemed to give a decent balance, ~4.5x faster for reads and only 15% slower for writes.

This technique will introduce more complexity into your system and reduce the flexibility of querying. However, if performance is a concern then it is a technique worth considering - but benchmark it first as your mileage may vary :)

Discussion

blog comments powered by Disqus

Colin Howe

I'm Colin. I like coding, ultimate frisbee and startups. I am VP of engineering at Conversocial