On the MongoDB site there is a suggestion that collections can be used to cluster data and get better performance as a result.
The idea is that a different collection could be used for each user's data. Internally, MongoDB will use different extents for each collection (an extent is a contiguous block of memory). By doing this we guarantee that a user's data will be stored in mostly sequential blocks on disk - making it far easier to read in data if going to disk.
Performance - Experiment
The theory is sound, so what about performance?
In this experiment I used two c1.medium (high CPU) instances on AWS - one to run the test and one to be the server.
The test script inserted 5,000,000 ~1kb emails into the database spread over 10,000 users (using a triangle distribution with a mean of 5,000 to simulate high/low volume users). 10,000 queries were then performed for 20 e-mails from a random slice of time for a random user. I'll add a link to the script tonight when I get access to my home laptop.
There was an index on user/time on every collection.
There were three variants:
- One collection for all users
- Fifty collections with user's spread equally amongst the collections
- One collection for each user
Performance - Results
Having a single collection for each user reduced the amount of storage needed (due to not needing to store/index the user ID):
- 308mb vs 476mb for indices
- 4044mb vs 4101mb for data
Query performance was as follows:
- One collection for all users - 91.0ms / query
- Fifty collections - 20.0ms / query
- One collection per user - 13.2ms / query
Insert performance was:
- One collection for all users - 5,167 inserts / sec
- Fifty collections - 4,350 inserts / sec
- One collection per user - 1,645 inserts / sec
Having a single collection per user was ~7x faster for reads, but, ~3x slower for writes.
Using fifty collections seemed to give a decent balance, ~4.5x faster for reads and only 15% slower for writes.
This technique will introduce more complexity into your system and reduce the flexibility of querying. However, if performance is a concern then it is a technique worth considering - but benchmark it first as your mileage may vary :)