Pilot light is the third of the five different disaster recovery strategies which are covered in Leading Disaster Recovery on AWS Serverless. This post dives into what it takes to lead a pilot light disaster recovery implementation, and how it will affect your teams.
What is the DR strategy "Pilot Light"?
A pilot light DR strategy focuses on making your data in a ready state, while reducing or eliminating your compute capacity. In an AWS serverless world, where a backup might use AWS Backup, the pilot light approach will use Amazon DynamoDB global tables and Amazon S3 data replication along with data propagation tasks to services such as Amazon OpenSearch, Amazon Elasticache or other related regional systems. These improvements allow data to be at a production ready state in your disaster recovery region continually, and allow your disaster recovery process to be must faster. Most of the compute resources, such as lambda functions, will not be deployed in a pilot light strategy, except those use the load tertiary resources. In the event of a disaster, a deployment of the compute resources would occur to set up the environment for use.
What are the problems with this approach?
Pilot light approaches allow for faster recovery but are an awkward middle ground for serverless systems. For this to work, the overall complexity of the system will significantly increase, as the team will need to separate the functionality of the solution based on what needs to be deployed in the pilot light region. It will now have to handle more complex deployments where resources may or may not already be present. An example of this is Amazon DynamoDB, where the Global Table will create the disaster region's table from the primary region, and the disaster region deployment will need to reference this existing table. This complexity can increase the amount of code your team has to write, or the vendors you partner with.
In this type of system, it is common to run into circular dependencies between stacks over the course of iterative releases, which will cause deployment failures in secondary regions due to missing resources. Circular dependencies are where two AWS Cloudformation stacks reference each other's resources bidirectionally created over multiple releases. The result is that neither stack can be deployed due to the missing references. This problem happens more often than you’d think, as it's easy for the development team to need a random resource from another stack and just reference an exported variable. It’s less common to run into services which exist in your primary region but are not available in your DR region. While it's less common, it still can happen but tends to be limited to compute type processing, such as transcription services or data format services.
The time it takes for a pilot light deployment to become operational can take anywhere from 20 minutes if everything is perfect and regularly tested or up to weeks if not and can require significant support from the development team.
What are you signing up for?
Minimum Leadership Level | Basic technical leadership |
Team Type | Standard development team with operational or solution architect support |
Team AWS Skill | Medium |
Risk Tolerance Level | Medium High |
Average AWS Outage | 6 – 8 hours |
Regional Recovery Time | 20 minutes to weeks |
Recovery requiring design changes | Likely |
Opportunity Level | None |
Average up front development costs | 1 – 3 months |
To make a pilot light strategy work, you’ll need at least two stacks: a data resource stack and a compute stack. If your data resources need data propagated to another resource, such as Elastic Search, then you’ll need a third stack. The data resource stack should export information needed by the other stacks. This is done to ensure resources which are in use don’t get accidentally removed without the other stacks being updated to no longer use the resource being removed.
While it's typically good to split apart your compute from your data sources, this will add to the burden the team takes on during the development process. As a side note, the reason to split up your compute from your data storage is due to situations where your compute stack gets into a state where it can’t roll back, nor can it update. This will also mean the number of release pipelines will increase, or the complexity of your release pipeline will increase. With the more complex release process, including more stacks with cross stack dependencies, you can easily run into the situation where exports between stacks create circular dependencies, making the system no longer deployable without any overt signs or symptoms.
You’ll also have more complexity as some resources will be generated for you in your disaster region, such as Amazon DynamoDB Global Tables, while others will need to be created, such as Amazon S3 buckets. For the existing resources, typical AWS Cloudformation approaches will be less effective to get details such as stream resource names. This means either more custom code which needs to be maintained, or options provided by third parties which will need to be researched and validated.
You will need to consider your vendor integration security measures as well, such as whitelisting of IP addresses. In these cases, you should setup enough resources, such as VPCs, to get your integrations working while still leaving your compute resources un-deployed. Having resources such as VPCs adds a new complexity, as your system will start to have configurations which will only apply to the specific region you’re operating in.
After setting those components up, you’ll want to write a run book which can be used to provide step by step instructions on what the team needs to do to recover the system. This is the same process as described in the backup restore process, and needs to include the plan on how to recover the system back to the primary region you want your system to run in.
How does this impact the team’s mentality?
The team at this point has experience with disaster recovery and including disaster recovery as part of their designing process. Your team has also experienced some of the initial issues of how your solution operates across multiple regions. This leads to the team also leveraging some of the more complex aspects of AWS Cloudformation and AWS, although a lot of it is used in the effort to avoid deploying services. At this stage, it's likely that the team still creates issues with names for global resources which can impact disaster recovery, but they’re starting to develop the skills to avoid that.
This team will also start using more aspects of AWS and AWS Cloudformation as part of their regular feature work, cutting down on the amount of code being written and leveraging more services to eliminate work. Configurations will start finding a home to live for each region, whether in AWS Cloudformation, source control or AWS System Manager Parameter Store.
For a new team, this mentality will be something they need to grow into, which will happen over the course of their first six months with the right leadership in place.
Conclusion
This approach to disaster recovery is not a destination for most teams, but more of a pit stop on the way to a much better disaster recovery posture.
Comments