Leading Disaster Recovery: Backup and Restore
What is the DR strategy “Backup/Restore”?
The backup/restore approach to DR is most in line with traditional approaches to disaster recovery, but in the serverless space offers very little value. In this approach your data stores, such as DynamoDB, S3 and Aurora/RDS have their data backed up to a different region, either through scheduled backups or through data replication. This information is not designed to be brought online until your system deployed all new resources and then the data is restored to those resources. The benefit to this approach is that backing up the data is fairly cost and time effective. Services such as DynamoDB can have global tables to send data to a new region, and S3 can replicate data as well, but this strategy doesn’t load services which support the system but are not databases. Regional services such as Elastic Search and machine learning datasets will need to be regenerated.
What are the problems with this approach?
The problem is that the restoration process of your data could take anywhere from hours to weeks and can be costly as your AWS service usage will go extremely high during the restore. If you have partner integrations which involve whitelisting IP addresses, recovery can be a drawn-out process of coordinating with service providers who might not see the urgency. In this case you might want to at least place enough infrastructure in place to get your security foothold established. For AWS outages that typically range from six to eight hours, recovering the backup and validating the disaster recovery system could easily take more time than the outage itself. In the scenario where a full region has a region wide disaster though, this approach allows a company to bring its systems back online in the worst-case scenarios.
What are you signing up for?
Minimum Leadership Level
Standard development team with an Operations Team
Team AWS Skill
Risk Tolerance Level
Average AWS Outage
6 – 8 hours
Regional Recovery Time
Multiple hours - weeks
Recovery requiring design changes
Up front development costs
2 – 4 weeks
In this approach the team is signing up copy its data to another region. The fastest approach to setting up a backup and restore disaster recovery strategy is to use AWS Backup. This is a newer service from AWS which allows a simple approach to backing up Amazon RDS, Amazon DynamoDB, Amazon S3 as well as various other resources. AWS Backup also has the advantage of a single point of configuration for where to backup all of your resources to, and will automatically discover and expand based on new resources which are deployed. AWS Backup also has another advantage, which is configurable schedules and retention periods. This capability allows your team a simpler upfront effort, while ensuring costs don't spin out of control. The negative aspect to this is that the recovery point objective, or how much data you will loose if you have to restore, can be quite significant. The other negative aspect to this approach is that after a disaster is resolved, getting your data restored back to your primary region will be a significant task and could result in downtime in order to get all the data synced correctly.
Another approach is to leverage native capabilities of your existing components. In this approach your systems start seeing recovery point objectives of seconds or minutes, rather than a day. Let's say your team is using Amazon DynamoDB. With Amazon DynamoDB there is a concept of Global Tables which are pretty simple to set up, and will stream data real time to multiple regions, providing an RPO of seconds. The other advantage in this scenario is that data syncing is bidirectional, meaning after a disaster is resolved, the data will automatically sync back to the region which was effected without your team having to put extra effort into the recovery process. If you’re using S3 then you’ll want to make sure to setup buckets in your disaster region and setup data replication from your source bucket(s).
Next, you’ll need to develop tools to load tertiary data sources, such as Amazon Open Search, Amazon Timestream, Amazon ElastiCache, and various others, from the data you’ve backed up. The tricky part here is that you’ll need to keep in mind that your data models will probably change over time, and your team’s code will have to take that into account. This area has a high probability for failure, as typically the team will regard disaster recovery as a last checkbox and focus more on features.
After setting those components up, you’ll want to write a run-book which can be used to provide step by step instructions on what the team needs to do to recover the system. This allows the team to review the steps and make sure there are no missing steps or capabilities which need to be performed. On a scheduled basis, at least annually, you’ll want to test this run-book and attempt to recover the system to verify your backups will work.
After verifying your backup process you’ll also need to put together a run-book for recovering if you want to move your system back to your primary region. This will most likely use some of the aspects you used for the original disaster but can differ as your original system might still have data up to when the disaster occurred. In this case, you’ll need to restore the missing data and all updates back to the original system and verifying it is operational before transitioning your customers back to the original region. This area can be especially tricky, as AWS restore concepts will typically create new resources such as Amazon RDS Instances or Amazon DynamoDB Tables to restore data onto. As your team will most likely leverage infrastructure as code, this could mean that part of the system that gets deployed and automatically configured would need to instead refer to the restored services instead.
How does this impact the team’s mentality?
Teams that use this approach tend to be more on business value and less technical, since this approach in theory could be setup by a cloud operations team. The teams tend to focus less on how the services work even if they were involved in the backup process. The team often will assume there’s mechanisms that will help them get back online but will have either a very manual process with assumptions or have vague descriptions which will need to be interpreted real-time in the case of a disaster.
The team will often not focus on release pipelines which help accelerate development or more advanced approaches for releases such as Blue/Green or CICD. This is due to the team not being pushed to leverage more advanced concepts within the cloud computing space, which can make working on monoliths an inherent tendency.
Long term this approach can lead toward systems which can’t be deployed in another region without significant rework like the “Single region with high availability” approach. In traditional legacy approaches to disaster recovery this is solved by annual testing, and validation processes, which can lead toward your team spending less time focused on capabilities which will drive business revenue and competitive advantages.