The excerpt gives me a headache too. A couple of days ago, I decided to go for the AWS Certified Solution Architect certification. Why? 1) I thought I was a pro when it came to AWS, 2) Check under #3. I quickly realized that I actually don’t know as much as I thought, but a couple of online CSA-prep courses and several all-nighters cramming through hands-on labs later, I think I’ve got this.
The certification is about leveraging AWS’s service portfolio in a way, which brings failure rates and down time to a dismissible margin, making solutions cost-effective, finding the best components for a use case, etc. By default, most cloud services are not inherently fault-tolerant, thus systems must be designed to ensure fault tolerance and high availability. Luckily, AWS has a number of great services which, when put together, help system admins sleep with ease.
As implied in the excerpt, I will be discussing an architectural practice for producing cost-effective solutions, while achieving high availability and fault tolerance in system design, utilizing AWS’s Elastic Load Balancer (ELB) and Auto Scaling (AS).
Fault Tolerance & Availability
These are two (among many) concerns that pop up as a connotation when people hear the word “cloud”. It is often a deal breaker when migration to cloud is in question, because, well, frankly we will never have full control or visibility of the infrastructure when it resides in different places around the world. It sucks when you feel “out of control”. BUT! Not all hope is lost just yet. With cloud services becoming increasingly more advanced, we can now architect solutions diminishing any such concerns.
Fault Tolerance is simply the ability to stay operational regardless of components failing. The more fault tolerant we can design or system to be – the better. As mentioned, not everything in cloud is inherently tolerant, but a system can be designed to achieve acceptable fault tolerance. AWS provides an architect with tools to fulfill a vast array of use cases. Some of the more commonly used ones are (by any means not an exhaustive list):
- ELB (traffic balancing)
- Auto Scaling
- EBS (backup data)
- Route 53 (DNS failover)
- SQS (de-coupled system design)
Availability is the weighted proportion of the time, during which a service is expected to be available. For instance, SLA for EC2 indicates the following:
“AWS will use commercially reasonable efforts to make Amazon EC2 and Amazon EBS each available with a Monthly Uptime Percentage (defined below) of at least 99.95%”
99.95% comes down to around 4.38h of downtime per year, meaning that AWS guarantees EC2 to be available the rest of the time. The downtime may not seem like a lot of time, however, based on what type of service we want to offer to our audience, 4h can cause chaos. Let’s say we are a payment-processing service and just as a user initiates a transfer of funds from A to B, our availability zone goes down. Awkward. Logs will probably fail to record the completed transaction, the payment will fail, and the funds will be lost somewhere in the air. All hypothetical, but you get the idea.
For a system to be considered Highly Available and Fault Tolerant, it must, at a minimum, have one ELB serving traffic to an Auto Scaling Group (ASG) spanning across two Availability Zones (AZ), with at least two EC2 instances, in separate subnets, within each AZ. Say that 5 times fast.
Elastic Load Balancing
A Load Balancer (LB) is responsible for equally distributing incoming traffic across associated instances. It is AZ agnostic, which is great, because we can make it work with a VPC, as opposed to having to define a new one for every AZ. Another great thing is that it keeps its own DNS table, meaning it can be accessed directly. Furthermore, this can be used to do cool things such as applying SSL directly to the LB, consequentially reducing the computing power required by an EC2 instance ( save $$ ).
There are two types of LB:
- Classic: application or network level data. works fine with private clouds, cheap.
- Application: advanced application data (as well as data from requests). More expensive, used for things like HTTPS and custom apps.
Classic is great for internal use (e.g. for balancing private subnets) or when data privacy doesn’t matter (LOL), while Applicaton LB has more use-cases specific uses.
This is AWS’s super cool service that practically embodies the term elasticity. It is powerful when combined with other services (such as ELB) to further improve fault tolerance and availability.
It has the power of increasing or decreasing the amount of provisioned instances within an ASG. For that to be made possible, it requires specifying multiple components according to the use case it is being designed for:
- ASG – logical unit for scale and management
- Min/max instances
- Default number of instances
- Launch Configuration – an instance’s template
- AMI ID
- Instance type
- Security Groups
- Scaling Plan – when and how to scale
- metric (Cloudwatch)
A ton of different options and specifications, as you can tell, to help enable a fully-customizable plan.
How to Set it up?
I will include a step-by-step process for enabling ELB and making it work in junction with AS to better fault tolerance and ensure high availability.
Requirements (to fulfill FT & HA definition):
- VPC spanning across multiple AZs
- ASG within the VPC (also spanning across AZs)
- One subnet in each AZ
- One EC2 instance (with public IP) in each subnet
To make things easier, I made a visual diagram:
Alright, now lets go.
- Navigate to Load Balancing->Load Balancers
- Define Lead Balancer
- Select subnets (advanced VPC)
- Listener Config (:80 default)
- Assign a SG or create a new one
- Health check configuration
- Decide on the baseline (what gets pinged/where)
- Decide on intervals on which health checks are performed
- Add EC2 instances
- If you added subnets with existing instances, you may add those (remember; min 1 on a subnet per availability zone!)
- Enable/disable Cross-Zone LB (should be enabled for purpose of FT and HA)
- Add tags (useful for organizing purposes)
- Review and launchYou should see something like this:
- Define Lead Balancer
- Navigate to Auto Scaling Groups tab (Console->EC2->AS->ASG)
- Click “Create Auto Scaling Group”
- Create a new launch configuration
- Choose an AMI
- Can be a new one, a snapshot of your existing template, or even import from Marketplace
- Choose instance type
- Configure details of the launch configuration
- You can apply IAM roles here – neccessary if used in real world
- Apply custom scripts that will run upon provisioning – again, a necessity if this is used in practice
- IP type (public/private)
- Configure storage details
- Configure Security Groups
- Add HTTP, Port 80, Source 0.0.0.0/0
- Choose an AMI
- Configure ASG
- Give it a name
- number of instances to begin with
- Must start with a min of 2 for the purpose of this example
- Select the VPC which has ELB configured
- Add subnets
- Again, minimum two in different AZs
- Advanced settings
- Check the box to receive traffic from the LB
- Select the LB
- Configure scaling policies
- Check the box to adjust the capacity
- Select the minimum and the maximum number of instances
- Min will be 2
- Increase group size
- Give the rule a name
- Create new alarm
- Uncheck “Send notification” box (This is used when SNS is configured)
- Apply: When CPU Utilization >= 80%
- Decrease group size
- Give a name
- Create new alarm
- Uncheck “Send notification” box
- Apply: When CPU Utilization <= 40%
- Configure notifications
- Used when SNS is enabled, skip
- Configure tags
- Preference, can skip
- Create ASG
If all went well, you will have an ASG looking somewhere along the lines of:
If you are looking at this console the moment you created the ASG, the number of instances will be 0 (that is if you didn’t select any existing ones). This simply means they are being initialized.
When (if) an instance in this ASG fails a safety check, a new one will appear to fulfill the minimum instances number requirement (whatever you specified). If you do not have an actual running application to manipulate this type of behavior, you can test it if it works by visiting EC2 Dashboard->Instances and try terminating an instance placed within the ASG. Based on the health check interval, you will see a new one initializing on its own!
We have just created a Fault-Tolerant and Highly-Available system design on AWS!
Caution: Don’t be a n00b like me; I made an error when specifying health checks during an early testing, so I had a billion of AMIs spinning up and terminating for a while. GG.
But hey, failure is how we learn, right?
There we have it – a guide to architecting a cost-friendly solution for achieving fault tolerance and high availability. As I prepare for the CSA exam I plan to post a couple more AWS-flavored discussions to strengthen my knowledge and get comfortable with what I am going up against.
As always; thank you for your time!