Leading Disaster Recovery: Warm Secondary
Warm secondary is the fourth disaster recovery strategy, which is covered in Leading Disaster Recovery on AWS Serverless. This post dives into what it takes to lead a warm secondary disaster recovery implementation, and how it will affect your teams.
What is the DR strategy “Warm Secondary”?
Warm secondary in serverless terms means that the system is deployed across multiple regions with data continually being synced from the primary region to the DR region. The data at this stage is also tracking where it was changed, in order to avoid duplicate processing when it’s not applicable, which can also be called "sticky processing". When the primary region suffers a disaster, then the secondary region can be switched over to in anywhere from seconds to 15 minutes, depending on your DNS routing configuration. The odds of the fail-over going poorly is low, as the resources will already be present and ready to take on traffic.
Typically, in these implementations, the mechanism to switch from the primary region to the secondary region can be automatic, but switching back to the primary region might not be automatic. This is because of the potential that the system has data which must be synchronized before processing is switched back to the primary region.
What are the problems with this approach?
The biggest problem with this approach is around investment versus value. Your team will still need to verify the disaster recovery region at least annually, and there is still some risk that configurations are not fully verified in the disaster region.
With a disaster, there will also be an effort to synchronize the information from the disaster region to your primary region. Content in Amazon S3, Amazon RDS and other various resource types that won't automatically sync are an extra step in full recovery. Tertiary resources must also be re-populated before the primary region comes back online.
The team will still need to be conscious of circular dependencies in their AWS CloudFormation stacks. Since all deployments are incremental, there are fewer issues which a clean deployment would likely run into if one was needed. These types of dependencies won't affect your team in a disaster, which is a big positive, but will affect you if a new environment is being set up.
With that said, this isn’t a horrible spot to be in.
What are you signing up for?
Minimum Leadership Level
Semi-strong technical leadership
Standard development team with solution architect support
Team AWS Skill
Risk Tolerance Level
Average AWS Outage
6 – 8 hours
Regional Recovery Time
5 – 30 minutes
Recovery requiring design changes
Average up front development costs
3 – 6 months
At this point, the team will be building stacks which won't need to be segmented for disaster recovery to work. Amazon DynamoDB Global Tables sync data continually to the secondary region. Amazon S3 replicates data to the secondary region, and events are used in each region to continually replicate data to tertiary system such as Amazon Open Search, Amazon Elasticache, etc. The compute services will need to understand where the data was written and where it should be processed. The simplest approach here is that data written from the primary region should be processed in the primary region. Functions in the secondary region will automatically pick up once traffic is diverted to it, which makes the system automatically responsive to the fail-over with minimal effort. There will also be a need to have a plan in place for region specific configurations, such as which endpoints to use when failing over, how the region-specific resources are granted access to the customer, etc.
The deployment pipeline will be more complicated than a single region, but since the stack will always be deployed in the two mirrored regions, there are fewer things that can go wrong. The team will need to stay aware of the deployment pipeline, as the primary region could still succeed but the DR region could fail, but these failures will often show up in lower environments which will allow the development team an opportunity to fix the issues before running it in production.
This also means that you will double all your resources except for your primary routing resources. This means that any issues with caching will need to be fixed on both regions potentially.
How does this impact the team’s mentality?
The team will have understood how to create and maintain global resources with region specific names. They’ll also have a great understanding of how to manage multiple systems operating on the same data, allowing them to better understand how services interact and their capabilities, which should also help translate into more business value, especially around collaboration. The team will also be well versed in AWS CloudFormation and the variety of capabilities it has, using AWS CloudFormation and services to drive down the amount of code written and maintained by the team.
This team is reaching a point where they’ll probably be ready to take on that next step of building an Active/Active system.
You’ve already spent a significant budget on AWS disaster recovery without taking the final step to Active/Active, the final disaster recovery approach with the most benefit. What you’re left with is a manual effort for recovery and the potential that an individual that has to be on call in case a disaster happens. You’ll still need a run-book for failing over which needs to be tested at least annually, and you’re still going to pull people away from creating business related value.
This should be considered a stepping stone on the way to Active/Active.