How we migrated to AWS

A few weeks ago we (Conversocial) migrated our infrastructure from a shared Solaris host to AWS. I'm going to talk about how we did our migration, why we chose AWS is a big enough topic for another blog post.

The two goals for the migration were:

  • Migrate a customer at a time instead of doing a big bang migration. This meant that we could get a subset of our customers on the new infrastructure to make sure we were 100% happy before moving more and more customers across.
  • As seamless as possible for our customers. We wanted all our customers to carry on logging in at the same URL and not really be aware that they were on different servers (in different parts of the world).

Background - Our Infrastructure

Before pressing on it's worth saying a bit about what our infrastructure looked like. The infrastructure was fairly standard as far as Python/Django setups go:

  • a MySQL database with all our data in
  • some web servers running Apache to handle requests
  • some backend servers running Celery to do polling and process background tasks

With regards to size, our database wasn't massive but some of our customers did have several gigabytes of data.

General Approach

The general approach we took for each customer was as follows:

  • Mark the customer account as being currently migrated. This locked them out of the site but it freed us from having to worry about changes whilst we copied data around
  • Create an SQL dump of their data
  • Import the SQL dump in to the new database
  • Mark the customer account as migrated in both the old and new databases

Locking them out of the site

Locking customers out of the site was done for each customer in turn. In most cases the customer got full access again within five minutes. For our largest customers they were locked out for up to two hours. This wasn't as bad as it sounds - as we could do a customer at a time we were able to identify the ideal time for each customer and migrate them at their convenience and not ours.

To make things easier for the customer, the lock-down page had an auto-refresh to push them back to the site as soon as their account was migrated.

Dumping their data

To dump the data for each customer we created the MySQL Partial Dump tool. This tool allowed us to describe our schema and how all our tables relate to each other. It was then incredibly easy to create a MySQL dump for each customer. An added bonus is that this has given us an easy way to get data for testing environments (data cleansing is supported in the partial dump tool).

Handling Logins

Handling logins was probably the trickiest part of the migration. Whilst both infrastructures were live we wanted customers to be able to login at a single URL and be taken to the appropriate infrastructure without noticing anything different.

To handle this we created an additional subdomain: app.conversocial.com (changing to this was something we wanted to do anyway). Our existing infrastructure was hosting www.conversocial.com and we wanted customers to continue logging in at www.conversocial.com. To do this we altered our code in several ways:

  • For logins the email was first checked if it was in the legacy infrastructure
    • If the e-mail was found and the account was not migrated then the email/password was checked as normal
    • If the e-mail did not exist or the account was migrated then our old servers made a request to the new ones, passing through an encrypted token containing the email/password. If a match was found then a one-time token was created and passed back to the user for use in a redirect response that pushed them on to app.conversocial.com.
  • The forgotten password form followed a similar system
  • All new sign-ups were directed to the new infrastructure
  • The background tasks/pollers would only handle accounts that were on the same infrastructure as themselves
  • All page requests went through a Django middleware that checked if the account had become migrated. If so, the user was redirected to the same URL but on app.conversocial.com. This accounted for someone using the site being migrated between requests
  • Likewise, all AJAX requests went through a Django middleware. The difference here was that the response had to trigger some Javascript to do the redirection instead of relying on HTTP redirects

Putting it all together

All of this was put together using fabric so that we could do:

fab migrate:<account IDs>

Using fabric made it simple to handle connections to several servers for moving data around.

Doing a test-run

The entire migration went through two test-runs to ensure it would work when we did it for real. It also allowed us to iron out a few kinks.

There were two differences with the test-run: the migration flags weren't set for accounts as we migrated them. We didn't want customers suddenly using our new infrastructure before it was ready the data was cleansed of all e-mails and sensitive data before copying it. This prevented our new infrastructure from starting to send e-mails to customers (the old infrastructure would also be sending them e-mails)

Apart from that, it was all the same.

Conclusion

  • None of our customers noticed any unplanned down-time
  • The vast majority of our customers didn't even notice the migration at all
  • We didn't work crazy hours
  • Nothing went wrong that freaked us out
  • Small things did go wrong, but, the slow pace of the migration meant that the problems were small and isolated instead of catastrophic

Overall, we were very happy with the migration :)

Discussion

blog comments powered by Disqus

Colin Howe

I'm Colin. I like coding, ultimate frisbee and startups. I am VP of engineering at Conversocial