Service Disruption in the Public Cloud
12/08/2021 By Escala 24x7 Engineering Team
AWS Disaster Response Competency
This week’s AWS disruption is yet another reminder of an important principle from Werner Vogels, CTO at AWS, “Everything will eventually fail over time”. Often, the cloud is thought to be the savior that resolves all your problems without any downtime, and while it’s generally clear that cloud services win over on-premises data centers with respect to reliability and stability, that does not mean they are without failures*.
The key to maintaining a 24/7 operation is not only choosing the right cloud provider or the right software, it is also about making architectural decisions designed to react to failure and plan to handle it gracefully. It is also important to measure the cost of downtime by carrying out a business impact analysis (BIA) for your business to better determine what kind of strategy will best support your needs at a manageable cost.
There are 4 points to better understand right strategies for each situation:
1. One size does not fit all: The right fit for your business
There are essentially 4 primary strategies for disaster recovery: Backup & Restore, Pilot Light, Warm Standby, and Multi-site active/active (if you are not familiar with the details this whitepaper explains it well) and the one you choose will depend on how quickly you want to recover (RTO) and how much you are willing to pay for it. Ideally, in a modern serverless or container-based workload you can manage a very low RTO, whereas with larger and more traditional architectures we typically see RTOs on the order of hours, even when the strategy is pilot light.
Understanding this recovery time and the cost of downtime for your business will help you determine the right strategy. We are finding more and more that many small and medium businesses can tolerate the typical cloud outage (6-12 hours) and opt for a minimal strategy of backups to protect primarily against catastrophic data loss. On the other hand, in large enterprises, and particularly the financial sector, the move is more towards an active/active or active/passive multi-region architecture. This strategy becomes more like high availability instead of disaster recovery and in general can be done such that no human intervention is required in the event of an outage.
In addition to a good disaster recovery strategy, sometimes the best first step is to modernize your application to better support high availability and multi-region architectures. Migrating your database to Aurora** or DynamoDB*** for example, will allow you to take advantage of their respective global tables features to create a multi-region fault tolerant architecture. Serverless is another wonderful option for architectures that more easily support a multi-region strategy****.
3. Have a Plan
All this planning and strategy is important but equally important is how you respond to the event. It’s important to have an IT team that is well prepared and qualified to react quickly to an outage and execute the recovery strategy that is in place.
4. Have the right Allies and Team
Picking the providers not only based on the technology aspect but more importantly on the knowledge and adequate experience. Having a partner like Escala 24x7 on your side in these events will provide more peace of mind and allow you to focus on the continual operation of your business while we recover your IT infrastructure in another region.
A lot to take in? Not sure what to do next? Contact us at Escala 24x7 and we can help you find the strategy that best suits your needs; supporting you through the evaluation, implementation, and operation of your cloud disaster recovery strategy.
También lo podemos atender en español, con presencia en todos los países de latinoamérica de habla hispana
Set up an appointment with a specialist @ Info@escala24x7.com
*** Dynamo DB