EC2, Confluence, S3 and PostgreSQL
Update: EC2 will have persistent storage!
My side projects at the moment are both work related — that is, they’re related to Confluence, our enterprise Wiki. Even though I don’t work on that team any more the product still has a lot of mindshare with me.
My latest spare time project is running Confluence on EC2.
Amazon announced SimpleDB recently. I haven’t decided what it’s good for yet — its lack of transactions mean that it isn’t a drop in replacement for a relational database, but rather a component of massively scalable distributed systems.
While looking at SimpleDB I had another look at EC2. EC2 provides virtual machines, running Linux (or other operating systems virtualised under Linux, as far as I know) which can be created and destroyed at will, and cost 10 cents an hour to run.
Getting Confluence running on EC2 was very easy — I picked one of Amazon’s pre-built Fedora Core 4 images, installed PostgreSQL and a JDK, and that was that. After you’ve customised an image you save it to S3, and start it (or many copies of it) again when you wish.
Confluence feels as though it’s running on a machine with the specs Amazon specify, with the stated 1.7GB of physical memory. I haven’t yet tried the high end instances — quad 2GHz Xeon equivalent.
The catch with EC2 is that while a virtual machine has a 150GB disk, this data isn’t persisted when you shut the machine down — or when it dies unexpectedly due to either a software problem in the hosted operating system, or because Amazon shuts down your instance without you asking. It isn’t clear how often instances die unexpectedly, but Amazon do say “We recommend you should not rely on a single instance to provide reliability for your data.”
So simply saving the image to S3 once a day isn’t the way to go:
- You aren’t guarding against unexpected crashes — you may lose a day’s data.
- You have to shut down your application and DB down to get a consistent image.
- You are storing your entire OS image in each backup. This is particularly wasteful if you have many instances of the same application, as these should all be able to start from identical virtual machine images, simply being passed a parameter to tell them where to get their data from.
Confluence can be configured to keep essentially all its dynamic data in the database, so what we need to do is to keep an ‘up to date’ backup of our PostgreSQL database on S3 at ‘all times’ — that is, a backup which contains or current data minus at most n minutes of transactions. (OK, there’s also a Lucene index — if you deliberately shut down your machine every day, e.g. if you save money by keeping the instance up only during business hours, you would need to copy that index to S3 too, but if you are recovering from rare unexpected restarts you probably don’t mind reindexing)
Fortunately Amazon provides ample free (as in beer) bandwidth between EC2 and S3, and PostgreSQL provides good on-line backup facilities, so this task is relatively easy.
PostgreSQL on line backup allows you to specify a command to archive each log segment as it is filled — to restore you can roll these forward from a full backup.
To summarise:
- Use PostgreSQL 8.2 — on-line backups have some significant improvements over 8.1 and 8.0.
- Read the excellent documentation
PostgreSQL must be configured to use Write Ahead Log (WAL) archiving, so postgresql.conf contains an archive command like:
archive_command = '/Users/tomd/bin/archivewal %p %f'
where archivewal is:
#!/bin/sh
bzip2 -c $PGDATA/$1 >/tmp/$$ &&
s3put confluence/local-db-wal-$2.bz2 /tmp/$$ &&
rm /tmp/$$
s3put is a command from Tim Kay’s aws utility.
The backup procedure is:
- Tell PostgreSQL that you are starting a backup
- Copy your PGDATA directory to S3 — a simple tar archive is fine, because we are also copying the logs the backup doesn’t have to be consistent.
- Tell PostgreSQL that you have finished the backup. At this point the log files used during the backup will also be archived.
- Check that those logs have made it to S3 and mark your backup as OK to restore from
- Optionally delete older backups and older log file archives.
The restore procedure (which you would use when restarting an instance) is:
- Provide the EC2 virtual machine with a startup parameter which tells it which S3 bucket to look in for backups.
- Unpack PGDATA from your tar file (because we are starting a fresh copy of our EC2 image, PGDATA won’t exist when we start up, and postgres won’t be running)
- Create a recovery.conf file in PGDATA which tells postgres which command to use to copy log files back from S3.
- Start PostgreSQL. PostgreSQL will request all the log files it needs, that is, all the log files which were archived since the backup.
Once I have all this working and tested I’ll post a followup with the details, but I can tell you now that EC2 is the most fun you can have for 10 cents an hour!