I'm writing up everything I think someone running engineering at a startup needs to know. Every post will have a checklist at the end that can be used as a quick sanity check.
This is my first post for the series. Let me know what you think!
If Your Database Server Fails, What Happens?
Losing all your data is generally not an option. To prevent total data loss, most people look to backups. There is far more to backups than just copying the files somewhere with a cron job - the realm of backups is big enough that you can get a job as a backup engineer if you wished. In some cases, companies run quite happily without any traditional backups and instead rely on replication of data to enough locations that their data would survive anything short of a world war.
That said, there are some basic techniques and also some more advanced techniques you should be aware of to safeguard your data.
Techniques you Should be Aware of
I want to talk about recovery before backups - there is no point in backups if you cannot recover.
A DR (disaster recovery) test is a full or partial test of your recovery process. This can be made as close to a real disaster as you like, and the closer you get to reality the more confidence you can have in your recovery process.
The most important part to test is other people. You might know exactly how to handle database recovery but what if you aren't around the day that it all goes wrong?
Partial DR testing can also be performed automatically. A script could be written to regularly create a new server from recent backups and check that the backups work.
Examples of things to test include:
- Configuration of a database server from a fresh machine
- Recovery of your database from backup files to a new server
- Switching your application to use the new server
- Automatic failover of your database - take care with this, if it goes wrong you've just broken your production site
- Communication processes - who needs to know about the disaster? Do customers need to know? Do you know what you would tell them?
- Failure in recovery - what if the most recent backup is corrupt?
Whilst you are thinking about it, you shouldn't just limit DR testing to your servers. What about your office? What about people?
A DR test will prove that you can actually recover in the event of a disaster. Until then, you're just hoping. The first time you do it, you will encounter flaws in your process. Testing it gives you the opportunity to improve it before you're doing it for real.
Doing regular DR tests will also ensure that other people know how to cope in a disaster.
DR tests can be time consuming to perform and can lead to real downtime if you are testing the failover of real components. The best way to mitigate this is to do such testing out of busy times to minimise impact.
The hardest part of doing DR testing is remembering to do it regularly. As your company evolves your recovery processes might need to evolve too, regular DR tests will highlight this need.
Full Database Backups
A full database backup is a useful but surprisingly hard thing to get right. Ensure you thoroughly read everything available on backups for your chosen database technology.
Aside from the obvious, full database backups can also serve as good sources of data for testing or analysis (after being suitably cleansed of any confidential data).
Full database backups are hard to get right.
- Consistency - is the backup consistent? Have you performed whatever your database requires to ensure the data is consistent? E.g. do you prevent writes to the files whilst creating the backup
- Restorability - can the files be restored? This sounds silly but it's relatively easy to create backup files that simply do not work (e.g. your backup device could have run out of space)
- Time to restore - if backups are large they can take a long time to restore from (especially if copying from a different data centre). Backups are nice. Knowing it will take 24 hours to copy the backups isn't nice.
- Security - are your backups secure? Many leaks of confidential data have been where someone has gained access to backups
- Regularity - how regular are the backups? If they're your only option then you could lose a lot of data
- Load - creating a full backup can place a lot of load on the server chosen for the backup. It can also slow down network performance as large files are copied
Some database technologies give you the option of maintaining logs of operations (often called journals or binary logs). These can then be combined with a full backup to restore data to a single point-in-time - providing your log goes back as far as the time when the full backup was taken. In general, the database will start with the full backup and then replay the logs to get the database to the desired point in time.
Journals are generally smaller than the entire database and so can be backed more regularly than the entire database. This gives smaller windows in which data could be lost.
Journals can also be used in the event of an undesirable operation being performed on the database (e.g. deleting a lot of data accidentally). This is such a common use case that there are tools for most databases that will assist you in replaying entire journals except the specific commands that weren't wanted.
Replaying journals can take a long time. Combine that time with the time it takes to restore a full backup and you could have a long wait on your hands.
If your database fails you are probably going to be stressed. When you are stressed you are more prone to making mistakes. When recovering your database the last thing you want to do is make a mistake. Creating a script that performs recovery can make life much easier. Even better is a script that can be run with one click and zero configuration.
Aside from reducing the chances of mistakes, a recovery script will allow others to do the recovery with minimal learning on their part.
Recovery scripts also make automatic DR testing much easier.
Recovery scripts need to be kept up to date. It is also tempting to make them more complicated than they need be - which can lead to bugs in your recovery process.
Automated recovery includes any process whereby a server failure is automatically detected and the server replaced by a new server that has been configured automatically.
Downtime is kept to an absolute minimum. No more getting up in the middle of the night to restore a server.
Creating auto-recovery systems is hard. Detecting when a server has failed is hard - there are lots of ways a server could appear failed without it being a failure, e.g. a network issue could be making it appear down to your monitoring systems. Further, the failure might not be fixed by creating a new server, e.g. disk space might have run out.
A hot standby is a secondary database server that can be switched to if the primary database fails. Typically, the standby is kept up-to-date with the primary database by replicating writes to the standby (a variety of methods for doing this exist).
When the primary fails the standby is switched to and database functionality is restored, this is called failover.
Most databases provide basic functionality for hot standbys but require manual failover. Some databases go further and provide functionality for automatic failover.
Hot standbys are primarily useful when uptime is a concern. In general, a hot backup will allow you to restore database functionality within a few minutes (if not seconds).
A further benefit is that very little data is lost - typically, only the data that was written to the primary but hadn't yet been written to the standby. In some configurations it is possible to have zero data loss - some databases allow for writes to be classed as successful only when they have been replicated to a standby.
Most of the problems of hot standbys are related to how they have been configured.
One of the biggest problems with a hot standby is that they do little to protect against malicious (or accidental) data corruption. If someone deletes all the data on the primary server then it will likely be deleted from the standby too. Time-delayed standbys can be used to solve this problem.
The placement of a hot standby is important. If the standby is in the same data centre as the primary then it is useless in the event of power outages, network outages, etc.
It is tempting to have standby servers running on cheaper hardware than the primary servers. This may save money but has the downside that performance may be degraded when the primary server fails. It also increases the risk that the standby cannot keep up-to-date with the primary server.
A time-delayed replica is the same as a hot standby except it is intentionally behind the primary server. Typically, a time-delayed replica is kept an hour or so behind the primary. Generally, the time-delayed standby is kept out of the set of servers that can be failed over to automatically.
A time-delayed replica is useful when the worst should happen: someone accidentally deletes your data (or maliciously during a security breach). Having a time-delayed replica makes it quicker to recover in these situations. It can either by failed over to directly (losing a little bit of data) or allowed to catch up until just before the point of disaster.
Whilst useful, time-delayed replicas add extra costs that might not be needed. They also add additional overheads in configuration and maintenance.
No Master Server
Not having a master server means that writes can go to any of a number of servers. If one server should go down then writes can continue being performed to the other servers.
There are many different ways to do this and each database technology has its own of doing it. Needless to say, many SQL implementations support this (e.g. Postgres, MySQL) as well as other technologies that do not use SQL (e.g. Cassandra, Riak).
Having multiple masters makes it easier to keep writes going even in the event of a server failure.
If enough servers go down then writes become impossible. Ensuring that servers are in different racks (at the least) or in different data centres reduces the chance of multiple servers going down at the same time.
In general, the SQL implementations are significantly more complex than normal master-slave replication. The non-SQL technologies that have no master are unfamiliar to most developers and this could slow down your development (once the technologies are learnt you might find you develop faster as downtime issues occupy less time).
As I said earlier, there is a lot more to backups than you might have thought. As with all things in technology there are a lot of trade-offs you will need to consider when designing your recovery processes. To make your decisions a bit easier I've put together a checklist for your backups in rough order of importance.
The Bare Minimum
- Backups are made automatically
- There is a written down restoration process that can be followed by any technical person in the company
- At least one DR test when you think you have the rest of this list done
- Periodic DR tests
- There is a one-click restoration script that can be run by any technical person in the company
- Backups are manually checked regularly by restoring to a server
- Backups are automatically checked regularly by restoring to a server
- Hot standbys are available that can be manually switched to
- Restoration can be made to a single point in time
Nice to haves
- Hot standbys are available that are automatically switched to
- Hot standbys are in a different data centre
- At least one time-delayed replica
- There is no master database and loss of one server has no impact on write/read availabilty