Solr, Solango and being IO bound

We just hit a problem where the indexing performance of our Solr instance dropped massively when re-indexing the entire database. At the start it would be doing 100 docs/second but after an hour or so dropped down to 10/second and carried on falling.

After looking at iostat I discovered that Solr was IO bound. Specifically, the disk was maxing out writes at 20mb/s (this is on ephemeral storage on a large EC2 instance). However, our total data size at this point was only 350mb. Meaning that we were doing a large amount of rewriting of our indexes.

We're using Solango (a python library) to talk to Solr. We've set it up to commit in batches of 500. However, by default Solango will invoke the optimize command after each batch instead of using commit.

Optimize is a heavy-weight command that will remove deleted records from indices, compact indices and generally do all sorts of good stuff. Good stuff that smashes the disk and isn't so necessary during every little step of a re-index.

So, after changing that to do a commit instead of an optimize we're now indexing roughly 150 docs/second and maintaining that pace :)

Discussion

blog comments powered by Disqus

Colin Howe

I'm Colin. I like coding, ultimate frisbee and startups. I am VP of engineering at Conversocial