I've recently been having a play around with MongoDB and it's really cool. One of the common messages I see all over is that you should only use it if your dataset fits into memory. I've not yet seen any benchmarks on what happens when it doesn't though. So, here is a benchmark on how MongoDB performs when the data is bigger than the amount of memory available.
Mongo server: An EC2 large instance (64 bit) running a Ubuntu 10.10 image from Alestic. Has 7.5gb of memory. Data folder was on the instance and not EBS.
Mongo client: An EC2 small instance.
The test will involve inserting X documents to MongoDB with the following structure:
key: n (where n is 0, 1, ... X - 1) text: 'Mary had a little lamb. ' x 100
There will be an index on key to prevent full scans of the data.
After the insert there will be 30,000 gets with random keys.
The expectation is that when the data set gets too large to fit in memory the random gets will become very slow. This will be due to MongoDB's memory mapped files no longer fitting in memory and needing to be read from disk.
When this thrashing of the disk starts happening it will be interesting to see what happens when a subset of the dataset is read from. To investigate this a further test will be run that:
- 99% of the time - reads from a random key chosen from only Y% of the keys
- 1% of the time - reads from any key chosen from the entire dataset
The expectation here is that for small Y the performance will be similar to when the entire dataset is in memory - as the pages that contain the subset of data will be in memory already and not need to read from disk.
A result spreadsheet is available here (Google Doc).
Up to 3 million documents the reads were consistent around 17s for 30,000 reads:
Keys Average time (s) Memory usage (mb) 10,000 16.8 forgot to check 100,000 16.9 547 1,000,000 18.0 1672 3,000,000 17.2 4158 10,000,000 74.1 7469 (16.1gb inc. virtual)Once the dataset got larger than the amount of memory available the read time got slow. It wasn't as slow as it could be in extreme cases as roughly half of the dataset would still have been in memory.
It's worth noting that at this point inserts started getting slow: 178s for 3 million documents vs 1,102s for 10 million documents (~17k inserts/sec vs ~9k inserts/sec).
What about when reading a subset more often?
Focus (%) Read 1 (s) Read 2 (s) Read 3 (s) 100 73.1 75.3 73.9 10 54.3 37.0 29.5 1 21.1 18.8 18.2
Focus in the above results refers to the %age of the dataset that was chosen for 99% of reads. In this case it was the first Y% of rows to be inserted - meaning that the pages were likely now out of memory by the time we wanted to read them.
The results show that MongoDB will perform just as fast on a dataset that is too large for memory if a small subset of the data is read from more frequently than the rest.
It was interesting to see the 10% figure drop over time. I suspect that this figure will get closer to 18s as the number of reads increases - more and more of the pages will be cached by the operating system and not need to be read from disk.
From doing this it can be seen that the performance of MongoDB can drop by an order of magnitude when the dataset gets too big for memory. However, if the reads are clustered in a subset of the dataset then a large amount of the data will be able to be kept in cache and reads kept quick.
It's definitely worth noting that it's normal for the performance to drop by an order of magnitude when the database has to start hitting disk. The point of this experiment was to make sure that it was only one order of magnitude and that if reads were focussed the performance would stay high.
The code for the benchmark (for improvements and your own testing) is in github: http://github.com/colinhowe/mongo-benchmarks/blob/master/bench.py