August 7th, 2018
Monitoring Serverless Microservices
By Andrew Clark

We’ve had a few customers ask about best practices for monitoring their serverless applications. These are typically Lambda functions behind API Gateway which use DynamoDB as a data store. The nice thing about using serverless technologies on AWS is that all resources are constantly generating metrics and sending them to CloudWatch. All this data is available to you — you just have to know where to look and what to do with it.

CloudWatch Metrics

The CloudWatch Metrics and Dimensions Reference lists AWS services, the CloudWatch metrics that are available for them, and what they mean. I’m going to cover some of the metrics that you might find useful to monitor serverless applications, such as microservices. These can be monitored at a combined regional level or they can be tracked individually on a per-Lambda function, for example.

Lambda

Metrics from Lambda are sent to CloudWatch every minute. Below are some metrics you might want to keep an eye on. I would recommend creating alarms with certain thresholds so you can be instantly notified when things start moving in a certain direction:

Metric Description What to Do with It
Invocations How many times a function is invoked (successes and failures, but not throttled attempts)

Useful statistic: Sum

Keep an eye on this to know when you start seeing more traffic than expected so that you can prepare for possible scaling challenges and increase soft/hard service limits.
Errors Number of failures due to errors in a function (does not include those resulting from throttles or from the Lambda service itself)

Useful statistic: Sum

Errors are never good, so get notified right away when something stops working correctly so you can address it.
Duration How long functions run

Useful statistic: Average
Measured in: Milliseconds

The max timeout of a Lambda function is 5 minutes, and each function’s configuration has a timeout setting that can be set up to this. Create some alarms to know when you start approaching this value so you can increase the limit or re-architect outside of Lambda.
Throttles How many times a function is throttled

Useful statistic: Sum

Other alarms should have prepared you to prevent throttles from happening, so when this happens you definitely want to know about it and fix it right away.
ConcurrentExecutions How many function executions are happening concurrently

Useful statistic: Average

Compare this to the soft limit currently set in the region (1,000 by default) and increase with AWS support as you approach it.

Other metrics also exist, such as those related to dead letter queues. Take a look and see what else you might want to track. Also, you can break things down by function or you can monitor individual versions or aliases.

API Gateway

Metrics from API Gateway are sent to CloudWatch every minute. Here are some that I like to track:

Metric Description What to Do with It
4XXError The number of client-side errors

Useful statistic: Sum. You can also use Average to get an error rate.

Watch out for increases in this metric as it might indicate that consumers of your APIs are expecting to interact with it differently than its implementation. It may let you know if you’ve made breaking changes or are not honoring your API contract.
5XXError The number of server-side errors

Useful statistic: Sum. You can also use Average to get an error rate.

Similar to the Lambda errors, this will tell you when something is not working as expected with the APIs you’ve built.
Count The number of API requests

Useful statistic: SampleCount

Similar to the Lambda Invocations, it gives you an idea of demand.
IntegrationLatency How long API Gateway is waiting for something on the backend (e.g. Lambda)

Useful statistic: Average
Measured in: Milliseconds

Similar to Lambda Duration, but keep in mind that API Gateway requests are limited to 29 seconds despite the 5 minute Lambda max. Keep an eye on it so your API requests don’t get close to timing out.

Cache hit ratios and different ways of measuring latency are also available. Also, you can break things down by API or you can monitor individual stages, resources, and methods. Some of these require turning on more detailed CloudWatch Metrics at an extra cost.

DynamoDB

DynamoDB publishes a long list of metrics, but the following are some good ones to start with. Some are sent to CloudWatch every minute and others every five minutes.

Metric Description What to Do with It
ConsumedReadCapacityUnits How many RCUs are being used

Useful statistic: Sum

Track how close you are to exceeding your provisioned throughput.
ConsumedWriteCapacityUnits How many WCUs are being used

Useful statistic: Sum

Track how close you are to exceeding your provisioned throughput.
ReadThrottleEvents How many read requests are throttled

Useful statistic: Sum

Know when you’ve exceeded read limits.
WriteThrottleEvents How many write requests are throttled

Useful statistic: Sum

Know when you’ve exceeded write limits.

Note that there are some nuances to how these metrics apply to indexes.

Taking Things Further

That covers Lambda, API Gateway, and DynamoDB. Take a look at the other services being used by your applications, such as S3, SNS, SQS, and Kinesis. These have CloudWatch metrics as well.

You’ll also want to track application logs such as those coming from Lambda and API Gateway. You can create metric filters in CloudWatch Logs to know when your logs start reporting things of interest.

Route 53 healthchecks can be helpful too in monitoring the overall availability of application endpoints.

Some companies have entire departments dedicated to operational monitoring, so this process can be an ongoing one, but this should point you in the right direction if you’re just getting started.