April 11th, 2017
Automation is Hard
By Justin Iravani

Here at 1Strategy, a part of what we do is education. To that end, we build labs to help customers get some hands experience across various AWS services.

For example, creating an EC2 instance, step-by-step via the console is a very instructive process and helps you understand the various pieces of machinery involved. What is an AMI? What are all the different instances types? How does an EC2 instance fit into the greater context of VPC, subnets, security groups, and Elastic IPs? Our lab environment provides a safe space for customers to learn and explore.

Typically, our labs are 30 minutes to an hour and involve around 20 people, each spinning up some resource in our AWS lab account. Because these resources are in our account, we are obviously incurring the related costs and are fiscally obligated to minimize those costs as well as we can.

The best set of cloud maxims I’ve ever heard was from a 2016 AWS re:Invent talk by Soofi Safavi:

  1. If it moves, measure it
  2. If it’s not monitored, it doesn’t exist
  3. If it’s not automated, it’s not finished

I absolutely agree with all of these, but especially number three. While we could—after the boot camps—have someone go in and manually shut down those resources, that is a horrendously inefficient use of time and effort.

Enter Automation

Part of my role in our trainings has been to do some fairly simple automated resource teardown and automated user provisioning in our training environment. This automation does 3 things:

  1. Saves money: There are no child resources left running
  2. Saves time: We don’t need a team member to sit around for 30 minutes manually turning things off
  3. Provides a much better experience to our attendees: It takes about 20 seconds for us to create an arbitrary number of user accounts (with appropriate permissions), send out credentials, and have people start logging in. I’ve heard stories of other APN partners spending a significant portion of their lab time just provisioning user accounts to attendees–not a great experience, if you’ve paid hard earned cash to wait for some “expert” to create an AWS account for you.

Now that I’ve been at it for a little while, I’ve had some concepts reinforced and I wanted to reiterate them for you, dear reader. While perhaps obvious to some, I have seen customers struggle with how to get rid of orphaned resources.

Automated teardown is hard. Often times, if you look in the console, you can be led to believe it’s easy to teardown resources. Take a VPC for example, in the console it’s pretty much two clicks to delete a VPC. Click “I acknowledge,” press “Yes, Delete,” and you’re done.

It should be easy to do this programmatically as well right? WRONG.

The console hides quite a bit of complexity. From the documentation:

If you wanted to teardown a VPC from a Lambda function, your Lambda function has to check for any related resources. That is to say you need to check for:

  • aws ec2 describe-instances
  • aws ec2 describe-subnets
  • aws ec2 describe-route-tables
  • aws ec2 describe-nat-gateways
  • aws ec2 describe-internet-gateways
  • aws ec2 describe-security-groups
  • aws ec2 describe-network-acls
  • aws ec2 describe-vpcs

Then once you’ve listed them, you have to figure out how to delete each type of resource and its dependencies. This is non-trivial, as each resource type has its own data format.

For example, if you want to find the instance ID on an EC2 instance, it’s a few layers deep in the describe call:
describe_payload[‘Reservations’][’Instances’][‘InstanceId’]

Whereas an RDS instance identifier is located somewhere else entirely:
describe_payload[‘DBInstances’][‘DBInstanceIdentifier’]

To make it more obnoxious, there is no universal resource.destroy() method (Do you need to delete, terminate, or disassociate a resource? Better go and check the documentation.).

Ordering here is also very important and while logically it makes sense, you’ll probably try and delete something out of order at least once (Is it route tables that are deleted first? Or is it subnets?).

Furthermore, in our use case, we’re taking a wrecking ball to everything, which makes it somewhat easier. However, if you wanted to be more intelligent about which resources need to be destroyed and when (when is a whole different discussion), a great deal of effort is required.

My point is that to do this manner of teardown requires a lot of custom coding. Surely, there has got to be a better way!

Infrastructure as Code

For our use case, there really isn’t, but good news for you: there is a better way!

Anyone familiar with AWS will have heard of CloudFormation. This powerful AWS offering allows you not only to represent your infrastructure as code, but also handles a significant amount of complexity for you. Specifically, it helps in creating resources (called Stacks) for your environment in the proper order. CloudFormation also keeps track of the resources provisioned and provides an easy way to delete all of those resources at once (when they are no longer needed), no obnoxious custom code needed!

While this seems obvious to many, I’ve had customers who bump up against this after they have servers and databases all over the place. Most of which (of course) aren’t properly tagged or identified with what they do, who owns them, whether or not they are needed by other resources, etc.

Be it CloudFormation, Terraform, Ansible, Puppet, Chef, or SaltStack, I suggest finding an Infrastructure as Code (IaC) tool as soon as possible.

For those whose emphasis is more around serverless, there are IaC tools in that space as well
(E.g. Serverless Framework https://serverless.com/).

A small warning though: be thoughtful in which tool you choose. Take AWS labs’ chalice for example (https://github.com/awslabs/chalice)

This is an awesome starting place for developing and deploying Lambda functions and API gateways. Simply navigate to your Lambda code and chalice deploy. It will upload your lambda function, create and configure an API gateway, and even create an IAM policy to make sure everything runs properly. There is zero learning curve.

That said, there is no chalice undeploy, so all of those resources you just created have to be removed manually. Easy if you have 5 IAM policies in your account, not as easy if you have hundreds (or even just a hundred).

 

In Summary

You can save yourself a lot of time, money, and headaches by having a strategy to manage resources and automation.

 

Code Sample

Below is a repository with the v1 of my teardown scripts for a lab where attendees create: a security group, an EC2 instance, and an RDS instance which the EC2 instance talks to.

https://github.com/1Strategy/automated_teardown

Initially, these scripts were run as 3 separate cron jobs (from CloudWatch events). The ordinality here is that first the EC2 are terminated, second the RDS instances are terminated, and the lastly Security Groups are deleted. As a side note, Security Groups don’t have any kind of create time property. When would you ever need that? As such there needed to be a delay between EC2/RDS teardown and SG cron jobs.

Admittedly, this initial approach is very ham-handed. There are certainly other ways to accomplish this (say AWS Config), but it worked at the time.

At AWS re:Invent 2016, AWS released a Lambda orchestration tool called Step Functions (https://aws.amazon.com/step-functions/). In Mar 2017, AWS added a feature to CloudWatch Events that allows it to invoke Step Functions. As such, the v2 implementation of the teardown can be a lot more sophisticated. For example, ordering, error handling, and retries can all be handled by Step Functions. This will probably be covered in a future blogpost.