Whenever you do a schema change in MySQL you run the risk of locking tables for a long time and making the database unavailable to anything that wants to read from the tables being altered. We (Conversocial) have been looking at how best to do MySQL schema changes whilst keeping as much uptime as possible. We found a few common themes in the solutions that applications use:
1. Take the downtime
This isn't really a solution to maximising uptime but it's quick in terms of development time and easy. It's worth considering this method as a lot of applications can easily use this and don't really need to worry about high availability.
The drawback is that the application must be able to afford downtime at specific times or have a database that is small enough for the downtime to be unnoticed.
Some applications shard their database into geographical regions and serve customers from the closest region. This means that the downtime for customers from Australia can be at 2am Australia time when most Australians are asleep, without worrying that customers from Europe will be greeted with an application that doesn't work.
2. Schema-less MySQL
This is what friendfeed do.
The general approach is to store the data as JSON blobs in the database. Index tables are then created with a schema and use MySQLs indexing features. For example, instead of a user table with columns for email address and password the schema would be:
- a user table with an integer ID and a JSON blob containing the data
- a user_email_index table containing the integer IDs and an email column - the index would be on the email column
This is a nice solution but a schema-free database such as Mongo, Redis or CouchDB may be a better fit. The main reasons for doing this instead of going NoSQL are:
- the application's team can continue using their existing knowledge of MySQL
- MySQL has transaction support - most NoSQL solutions do not offer rich transaction support
- many hosting solutions offer well-tuned MySQL support (e.g. Amazon RDS)
3. Don't change the schema - normalise it
Instead of changing the tables in the existing schema, add new tables and have the application do joins. This requires no downtime but may make coding more difficult and queries slower.
This solution can be a better fit than it appears. For example, if you are adding an email field to a user table then it may be inevitable that users will want more than one email address for their account. In which case adding another table is going to be the correct approach.
Certain types of queries may be faster with this solution. For example, if all user emails are stored in a separate table, user_emails, then all the email addresses will be stored next to each other on the file system. A query for all email addresses for a set of users will therefore likely be faster.
4. Shard the database into tiny shards
It may be possible to shard the database into shards that are small enough to upgrade with only a few seconds of downtime per shard. This small downtime may go unnoticed by the application's users
A benefit of this solution is that it can make partial roll-outs of code easier to perform. Many online applications perform slow roll-outs of code to ensure that any unexpected problems are caught before every user is affected by the problems.
The application must be designed to handle this: fields may not yet appear in all tables and so the application either has to cope with this or not use them until all shards are upgraded. A further downside of this solution is that the application must be able to use sharding - this may not be a downside if the application gets large enough to make sharding a requirement and the sharding approach can produce consistently small shards.
5. Create upgrade tables and move data across to them in the background
For example, if the user table needed a birthday column then:
- a new table called user2 with the new schema is created
- data is copied from the user table to user2
- once the data is copied then rename user2 to user
The hard part here is copying the data - whilst the data is being copied it will be changing as the application continues running. There are several solutions to this and the most common and successful approaches use triggers to track changes during the upgrade. Facebook do this and have released a tool to help. The Openark kit also contains a tool for doing this.
This is a complex and tricky approach, but, the tools available make this easy to do. The downside is that the triggers and data copy will create extra load on the database that could degrade performance significantly. The tools available also carry some restrictions (such as no foreign keys for Facebook's tool).
6. Perform the upgrade on a slave and promote
The solution here is to perform the upgrade on a slave, wait for the slave to catch up once the locks are released and then promote the slave to master. Afterwards, the upgrade can be performed on the old master and that can then become a new slave.
This is a fairly simple solution for most cases but needs care around column deletions. The drawback to this approach is that the application must have similar latencies to the slave and the master. Typically this means that the slave must be in the same data centre as the master and this reduces the use of the slave as a stanby server if the master fails (e.g. it cannot help if the data centre has a power outage).
If the application already has read-replication on the same physical site then this can be really cheap (assuming slaves can be promoted to master without loss of data). One of the common ways of doing this involves using floating IPs for this, however, certain hosts (e.g. Amazon EC2) won't let you do that (they allow public floating IPs but that gets slow and expensive for MySQL).
This solution isn't talked about much - I suspect because most applications that use it get it working quite easily and don't feel a need to talk about it (it's not a very sexy solution).
There are many variations on these solutions but I've not yet seen any other methods that are drastically different. Each approach has it's strengths and weaknesses and you have to find a solution that works for you and your team's own strengths.