How To Achieve Reliability In The Cloud

04 Jun 2020

By Paul Riddle


With more mission critical services running in the cloud, the expectations for cloud reliability are higher than ever. Providing highly available and reliable services in cloud computing is essential for maintaining customer confidence and satisfaction and preventing revenue losses.

If we accept the fact cloud failures will occur, then the outcomes organisations may want to consider in relation to their cloud services fall into four main categories:

  • Maximise service availability to customers Make sure the service does what the customer wants, when they want it, as much of the time as possible.

  • Minimise the impact of any failure on customers Assume something will go wrong and design the service in a way that it will be the non-critical components that fail first; the critical components keep working. Isolate the failure as much as possible so the minimum number of customers is impacted. And if the service goes down completely, focus on reducing the amount of time any one customer cannot use the service at all.

  • Maximise service performance Reduce the impact to customers at times when performance may be negatively impacted, such as during an unexpected spike in traffic.

  • Maximise business continuity Focus on how your organisation and the service respond when a failure occurs. Automate recovery where possible and disaster recovery drills should be carried out to ensure your organisation is fully prepared to deal with the inevitable failure.

However, there are three best practices for reliability in the cloud. Following these practices may negate some or all of the scenarios listed above:

  • Foundations
  • Change Management
  • Failure Management

To achieve reliability, a system must have a well-planned foundation and monitoring in place, with mechanisms for handling changes in demand or requirements. The system should be designed to detect failure and automatically heal itself.


Before architecting any system, foundational requirements that influence reliability should be in place. For example, you must have sufficient network bandwidth to your data centre. These requirements are sometimes neglected (because they are beyond a single project's scope). This neglect can have a significant impact on the ability to deliver a reliable system. In an on-premises environment, these requirements can cause long lead times due to dependencies and therefore must be incorporated during initial planning.

Change Management

Being aware of how change affects a system allows you to plan proactively, and monitoring allows you to quickly identify trends that could lead to capacity issues or SLA breaches. In traditional environments, change-control processes are often manual and must be carefully coordinated with auditing to effectively control who makes changes and when they are made.

Failure Management

In any system of reasonable complexity, it is expected that failures will occur. It is generally of interest to know how to become aware of these failures, respond to them, and prevent them from happening again.

Regularly back up your data and test your backup files to ensure you can recover from both logical and physical errors. A key to managing failure is the frequent and automated testing of systems to cause failure, and then observe how they recover. Do this on a regular schedule and ensure that such testing is also triggered after significant system changes.

The objective is to thoroughly test your system-recovery processes so that you are confident that you can recover all your data and continue to serve your customers, even in the face of sustained problems. Your recovery processes should be as well exercised as your normal production processes.


A highly reliable and therefore highly available environment is worth pursuing for the long list of benefits it offers in return and the boost in user experience delivered in the process. The benefits include;

  • Improved platform availability
  • Reduced mean time to recovery from component failure
  • Operational peace of mind
  • Fewer on-call incidents

With the design principles and guidelines written about in this blog you can establish the pillar of reliability for your cloud environment. If you would like AltoStack to help you on your path to cloud reliability please contact us here.

Subscribe to Our Newsletter.

  • Join our community of DevOps and Cloud enthusiasts.
  • Get free tips, advice, and insights from our industry leading team of Cloud experts.