Many people believe that reliable systems are immune to service disruptions or infrastructure issues. All of these problems will be encountered by reliable systems, however. A system must be designed to recover quickly from failure to achieve reliability. It is important to have a good understanding of the system’s priorities and the possible availability requirements of each component. It can be difficult to design for reliability.

The AWS Well-Architected Framework offers best practices and guidelines to help you understand reliability. Amazon’s Reliability Pillar Whitepaper states:

Traditional on-premises environments can make it difficult to achieve reliability due to single points or failure, insufficient automation and lack of flexibility. You can build architectures with solid foundations, consistent change management and proven failure recovery methods by following the guidelines in this paper.

Understanding Availability

Before diving into AWS’ best practices, it is important to understand what availability means.

Availability = Normal Operation Times / Total Time

Important to remember that not all applications and systems require the same level of availability to be reliable. There will be different priorities and requirements for each application and system. 

Calculating the Total System Availability

You must also consider downstream systems that could be considered a “hard” dependency when calculating availability. A disruption to the downstream system directly results in a disruption of the upstream service. The theoretical availability of your application is affected by adding hard dependencies downstream.

Accessibility is not free

It is possible to underestimate the actual availability requirements of a system. This can lead to disastrous results. The cost of providing service increases as availability requirements rise. Additional redundant systems are required and hard dependencies must be limited. As you strive for greater availability, software development costs can quickly rise. High levels of availability will force your teams to work more slowly, which can impact your ability for innovation. It is important that you accurately determine the availability requirements for a system.

Networking and availability

It takes a careful approach to networking to build reliable cloud systems. There are many things to consider, such as:

  • Network topology
  • Future growth
  • Networking across Availability Regions and Zones
  • Resilience to failures, misconfiguration
  • Traffic patterns
  • DDoS mitigation
  • Private connectivity

AWS offers a wide range of services and tools to help you create highly available networks. These include AWS VPC and AWS Direct Connect. Amazon Route 53, Amazon Elastic Load Balancing and AWS Shield. It’s important to know the best practices for designing your network.

  1. Amazon VPC offers many connectivity options. These include options that use the internet or AWS Direct Connect. VPC Peering allows you to connect VPCs across and within regions. There are many VPN options available that offer private connectivity. AWS provides an excellent whitepaper entitled Amazon Virtual Private Cloud Connectivity Options to help guide your networking decisions.
  2. When designing your network topology make sure you consider security and protection. Make sure to use existing standards to protect private addresses. Subnets can be used to protect your applications from the internet. Consider also using AWS Shield Advanced, AWS Web Application Firewall, and AWS Shield Advanced to deflect DoS attacks such as SYN flooding.

Recovery Oriented Computing

A specific mindset is required to create a work environment that is resilient and reliable. To improve recovery in the event of failure, researchers have created the term “Recovery Oriented Computing” (ROC).

ROC identifies the following main characteristics to increase recovery:

  • Redundancy and isolation
  • All system-wide ability to rollback changes
  • Monitoring and determining health
  • Diagnostics
  • Automated recovery
  • Modular design
  • Capacity to communicate effectively

ROC recognizes that all systems can fail and that they are diverse in their types and extent. These failures could include hardware and software malfunctions as well as communication and operation errors. ROC places importance on rapid detection and automation of well-tested recovery pathways.

ROC avoids making many unique cases and instead maps many failure types to a limited, tested set of recovery paths. When designing reliable systems, a common mistake is to rely upon recovery paths that have not been tested.

Understanding your availability needs

AWS is one of many organizations that divide their services into two categories: “Data plane” and “Control plane”. The data plane delivers service in real-time, and the control plane handles less critical configuration tasks. Data plane operations in AWS include DynamoDB read/write operations and RDS connectivity. 

Application Design for Availability

AWS has vast experience running on AWS and has worked with thousands of customers to help design their availability applications. AWS’s vast experience has led to five common practices that can be applied to increase availability.

Fault Isolation Zones

Because most applications are made up of multiple components that have varying availability and dependencies on each other, it is necessary to find ways to increase the overall availability of the system. AWS offers several “Fault Isolation Area” structures to assist you in this endeavor, including Availability Zones (and Regions).

When fault isolation is needed, such as for active/active configurations that need synchronous replication, Availability Zones are a great choice. Although regions are closer together, due to their geographical separation, cross-region operations are not suitable for low latency applications.

Micro-Service Architecture

Microservices are a hallmark architectural development of the cloud era and a well-known approach towards the ROC characteristic modular design. Microservices have the greatest benefit – they are small and easy! You can divide your application into well-defined microservices so that you can concentrate your investment and attention on the microservices with the highest availability requirements.


It is difficult to design reliable and highly available applications. The Well AWS managed services architected Framework has many best practices that will assist you in your endeavor. 

Leave a Reply

Your email address will not be published. Required fields are marked *