CloudWatch is an AWS monitoring service that lets you keep an eye on infrastructure and applications. It allows you to collect and make use of metrics, set alarms, store your logs in a central location, and take automated actions when certain events take place. Additionally, you can create dashboards to visualize this information.
In my experience, most organizations significantly underutilize CloudWatch and its rich set of features. Many opt to use third-party solutions for monitoring and logging. While this may be the right choice for some, I still think there’s a significant amount of out-of-the-box functionality provided by AWS that is being left on the table.
With all the data being automatically generated by AWS services and stored in CloudWatch, there’s an abundance of information just waiting to be visualized and made useful for teams.
CloudWatch dashboards give you a customizable “home page” in the CloudWatch console to monitor resources in a single place, even if they are spread across regions. Dashboards can contain metrics, alarms, and even static text that you may want to see in the same place.
Today I want to take you through the process of creating a dashboard. As we move through each step, you’ll see a bit of what’s possible and also become aware of some of the nuances such as statistics, AWS namespaces, axes, etc. To get started, login to the AWS console and go to “Services” -> “CloudWatch.” Initially, you’ll see the main summary page:
From here, click on “Dashboards” on the left hand side of the page. You’ll then see the list of dashboards in your account, if any:
Click “Create Dashboard” at the top to create a new one and give it a name:
You’re then asked to add your first widget to the dashboard and to choose whether the metric will be graphed as a line, stacked area, or number. You also have the option to add static text, if desired. Note that this process lets you add a metric to a dashboard; adding an alarm is also possible, and we’ll get to that in a minute. Let’s go ahead and choose “Stacked area” for now and click “Configure.”
What we see next is a graph on top with metrics on the bottom:
We can browse the different metrics by clicking on a “namespace,” which is typically the AWS service related to the metric. Let’s go to “EC2” and start looking at what’s available. We may be asked to choose between “Per-Instance Metrics” and “By Auto Scaling Group,” which conveniently groups the data under auto scaling groups. Assuming we want to monitor an individual EC2 instance, we’ll click on “Per-Instance Metrics” and see a long list of EC2 instances (the name and ID is shown) and the available metrics for these instances.
These metrics include CPUUtilization, CPUCreditBalance (if it’s a burstable instance type), NetworkIn, and DiskReadOps, to name a few. Let’s choose “CPUUtilization” by clicking the checkbox next to it:
Doing so graphs it above and we get a good indication of what the widget will look like on our dashboard. If we wanted to, we could add multiple metrics to the same graph. This may help us to see how two or more are related to each other. To get a little more granular with the metric we have chosen, we can click on the “Graphed metrics” tab at the bottom and see a list of what has been added to the graph. From here, we can change the color, the label, the statistic, the period, and where we’d like the Y axis. We can also remove and duplicate what is graphed in order to create variations.
Statistic and Period are probably the most important to pay attention to. For Statistic, we can choose whether we’d like to see, in this case, the average CPU utilization, the minimum, or the maximum. The AWS documentation for each service typically spells out which statistics are most helpful for each metric. Period is used to choose the granularity of the graphed metric and whether it’s showing data points at a 5 minute interval, every hour, every second, or whatever it may be (assuming we have enabled that level of reporting).
If we’d like to adjust the graph’s axes, we can do so in the “Graph Options” tab:
I find that this can be very helpful if we know the general range within which a metric should fall. For example, if we’re expecting to see a consistent 25% CPU utilization on average, adjusting the Y axis to Min: 0 and Max: 50 will put the average squarely in the middle and we’ll easily notice visually when things stray away from that. In this example, we’ll just leave things set to Auto.
At the top left of the page you’ll see a pencil icon you can use to give a name to this graph, such as “CPU Utilization.” At the top right, you’ll see different timeframes for the graph. These are only for exploring the data as you create the graph; the dashboard will have them too and adjusting them will change the timeframe of all graphs. When we’re all done, we can click on “Create widget” to add it to our dashboard. While making changes to the dashboard, be sure to frequently click “Save dashboard” at the top. You should now have something that looks like the following:
Now let’s add a number widget to show the number of requests being made to our application load balancer (ALB). Go to “Add widget” -> “Number” -> “Configure.” Choose “ApplicationELB” under the AWS namespaces and then “Per AppELB Metrics.” Scroll down to the load balancer and check the box next to “RequestCount.” You may see a number that doesn’t seem right. You’ll want to head over to the “Graphed metrics” tab and check the statistic. Make sure it’s set to “Sum” to display the sum of all requests made within a particular time frame. Something like “Average” typically shows something close to 1 and doesn’t really make sense.
Adjust the period to what is desired. Let’s choose 5 minutes. Add a title to the graph such as “Requests,” click “Create widget,” and then save the dashboard. Note that despite the time range chosen on the dashboard, this number will remain constant and show the value for the most recent 5 minute period. If you choose an absolute range from the past, however, it will show the most recent 5 minute period within that range.
It looks like the widget added was a little smaller than the first one:
Let’s add another number widget by hovering over the Requests one and clicking the three dots that appear at the top right of the tile, then choose “Duplicate.” A new one is added which we can now update by hovering over the three dots again and choosing “Edit.” Let’s remove the metric from the “Graphed metrics” tab and add one for “TargetResponseTime,” which gives us the latency between the load balancer and the backend instances. We’ll keep the period at 5 minutes and use “Average” for the statistic. I’m going to label the widget as “Latency,” hit “Update widget,” and save the dashboard again.
This neatly fills in the space beneath the Requests widget. Note that you can hover over each widget and drag things around to get the right size and placement. The widgets lock into place along grid lines that make it easy to space things out proportionately. Here’s one way to arrange things:
Next, let’s add an alarm to the dashboard. To do this, you’ll want to go to “Alarms,” check the box next to one, and then hit “Add to Dashboard” at the top of the page. You’ll be asked to choose a dashboard and the widget type. After you’ve added it, save the dashboard. You can see that alarms are similar to metrics in that they are graphed on a widget, but they also have a red line indicating the threshold at which an alarm is triggered. There’s also a checkmark at the top right of the widget indicating that everything is currently ok. In the event that an alarm is triggered, the widget will become red too and stand out among the others. What I’ve done is chosen an alarm that goes off when the number of 5XX error codes returned from backend instances exceeds 100 in a 5 minute period:
Lastly, let’s add some static text that might be helpful. Amazon describes this functionality as allowing you to create operational playbooks for team members to know what to do during operational events or incidents. I’ll keep things simple here and add some instructions and a phone number to call should something begin to fail. Go to “Add widget” -> “Text” -> “Configure.” You’ll see examples of text being formatted with markdown. We’re going to add a simple paragraph like this:
We now have a dashboard showing CPU utilization, application errors, load balancer requests, latency, and some general information about who to call in an emergency:
Let’s adjust the timeframe at the top right to 1 day (1d), click the down arrow next to the refresh button to turn on “Auto refresh” with 1 minute refresh intervals, and blow it up to full screen so we can put it on a TV by going to “Actions” -> “Enter full screen.”
And that’s it! We now have a helpful dashboard that shows us various aspects of our systems and applications with continuous updating. When incidents occur and alarms are triggered, going to a dashboard is often the quickest way of identifying what is going on. They also serve as a tool for monitoring trends over time and preventing incidents before it’s too late.