MongoDB - Strategies when hitting disk

I gave a lightning talk on this at the London MongoDB User Group and thought I'd write it up here

MongoDB sucks when it hits disk (ignoring SSDs). The general advice is to never hit disk. What if you have to hit disk? Conversocial's new metrics infrastructure will allow people to see statistics for their Facebook and Twitter channels going back indefinitely. In general, the data being queried and updated will be in the past month and we can keep this is memory. But, we want to let them query the data going back further than this - which means hitting disk.

We found three good strategies for making hitting the disk less painful:

1. Use Single Big Documents

The naive implementation of our metrics system stored documents like this:

{ metric: "content_count", client: 5, value: 51, date: ISODate("2012-04-01 13:00") }
{ metric: "content_count", client: 5, value: 49, date: ISODate("2012-04-02 13:00") }

An alternative implementation is:

{ metric: "content_count", client: 5, month: "2012-04", 1: 51, 2: 49, ... }

In this case we have a single document that spans an entire month with the value for each day being a field inside the document.

For a simple test we filled our database so that we had ~7gb of data on an Amazon c1.medium instance (1.7gb RAM) then tested how long it would take to read the data for an entire year and averaged this over multiple runs:

  • Naive implementation: 1.6s for a single year
  • Single document per month: 0.3s

That's a huge difference. The reasoning behind it is fairly simple:

  • The naive implementation has a worst case scenario where it has to read from the disk for all 365 documents and each of these results in a random seek
  • Having a single document per month has a worst case scenario where it has to read from the disk for 12 documents

An added benefit of this strategy is that there is less overhead per day which means the working set can contain much more data.

Foursquare do this.

2. Unusual Indices

Sometimes it pays to experiment with unusual index layouts. The naive index for our metrics system is on metric, client and then date:

db.metrics.ensureIndex({ metric: 1, client: 1, date: 1})

A common tip with indexing is to have all new values go to one side of the index. We reasoned that although the date was at the end of our index we would be writing to the right of lots of parts of the index so performance should be OK. We were wrong. We compared the performance of the above index with a new one:

db.metrics.ensureIndex({ date: 1, metric: 1, client: 1 })
  • The naive implementation performed 10k/sec inserts but after 20 million inserts the performance dropped down to 2.5k/sec inserts and occasionally stalled with lots of IO to disk. Ouch
  • By switching to date at the start of the index our performance was kept constant at 10k/sec inserts

What about queries? By putting the date at the front of the index we realised we'd now have to query an entire year of data using an in query:

db.metrics.find({
    metric: 'content_count', client: 1, date: { $in: [ "2012-01", "2012-02", ... ] }
})

A test of the read performance of this displayed no noticeable impact.

The reasoning for this is that the naive implementation will be causing a lot of rebalancing of the trees used for the index. By switching the index around we ensured that all inserts went to one side of the index and rebalancing became a trivial operation.

3. Pre-Allocate for Locality

For most disks (not SSDs) the sequential read performance is vastly better than the random read performance. This means that we can read our metrics really fast from disk if we read them all from the same part of the disk. With MongoDB documents will reside on disk in the order that you wrote them unless they are resized and need to be moved around.

If we pre-allocate zero filled documents then we can force values for nearby months for the same metric to be stored on disk in the same location and then exploit the speed of sequential reads:

db.metrics.insert([
    { metric: 'content_count', client: 3, date: '2012-01', 0: 0, 1: 0, 2: 0, ... }
    { .................................., date: '2012-02', ... })
    { .................................., date: '2012-03', ... })
    { .................................., date: '2012-04', ... })
    { .................................., date: '2012-05', ... })
    { .................................., date: '2012-06', ... })
    { .................................., date: '2012-07', ... })
    { .................................., date: '2012-08', ... })
    { .................................., date: '2012-09', ... })
    { .................................., date: '2012-10', ... })
    { .................................., date: '2012-11', ... })
    { .................................., date: '2012-12', ... })
])

Now, when client 3 wants their values for 'content_count' for the past year we can serve it using one big sequential read.

And the benchmarks?

  • Reading an entire year without pre-allocation: 62ms
  • Reading an entire year with pre-allocation: 6.6ms

Despite the performance gains from this we decided not to do this. Pre-allocation can get expensive for sparse data: you end up wasting a lot of space storing zeros that are never changed.

Conclusions

MongoDB can be made to have decent disk performance. You've just got to do some of the work yourself to ensure that reads aren't too expensive.

Discussion

blog comments powered by Disqus

Colin Howe

I'm Colin. I like coding, ultimate frisbee and startups. I am VP of engineering at Conversocial