We’ve had a few customers ask about best practices for monitoring their serverless applications. These are typically Lambda functions behind API Gateway which use DynamoDB as a data store. The nice thing about using serverless technologies on AWS is that all resources are constantly generating metrics and sending them to CloudWatch. All this data is available to you — you just have to know where to look and what to do with it.
CloudWatch Metrics
The CloudWatch Metrics and Dimensions Reference lists AWS services, the CloudWatch metrics that are available for them, and what they mean. I’m going to cover some of the metrics that you might find useful to monitor serverless applications, such as microservices. These can be monitored at a combined regional level or they can be tracked individually on a per-Lambda function, for example.
Lambda
Metrics from Lambda are sent to CloudWatch every minute. Below are some metrics you might want to keep an eye on. I would recommend creating alarms with certain thresholds so you can be instantly notified when things start moving in a certain direction:
Metric | Description | What to Do with It |
---|---|---|
Invocations | How many times a function is invoked (successes and failures, but not throttled attempts) Useful statistic: Sum |
Keep an eye on this to know when you start seeing more traffic than expected so that you can prepare for possible scaling challenges and increase soft/hard service limits. |
Errors | Number of failures due to errors in a function (does not include those resulting from throttles or from the Lambda service itself) Useful statistic: Sum |
Errors are never good, so get notified right away when something stops working correctly so you can address it. |
Duration | How long functions run Useful statistic: Average Measured in: Milliseconds |
The max timeout of a Lambda function is 5 minutes, and each function’s configuration has a timeout setting that can be set up to this. Create some alarms to know when you start approaching this value so you can increase the limit or re-architect outside of Lambda. |
Throttles | How many times a function is throttled Useful statistic: Sum |
Other alarms should have prepared you to prevent throttles from happening, so when this happens you definitely want to know about it and fix it right away. |
ConcurrentExecutions | How many function executions are happening concurrently Useful statistic: Average |
Compare this to the soft limit currently set in the region (1,000 by default) and increase with AWS support as you approach it. |
Other metrics also exist, such as those related to dead letter queues. Take a look and see what else you might want to track. Also, you can break things down by function or you can monitor individual versions or aliases.
API Gateway
Metrics from API Gateway are sent to CloudWatch every minute. Here are some that I like to track:
Metric | Description | What to Do with It |
---|---|---|
4XXError | The number of client-side errors Useful statistic: Sum. You can also use Average to get an error rate. |
Watch out for increases in this metric as it might indicate that consumers of your APIs are expecting to interact with it differently than its implementation. It may let you know if you’ve made breaking changes or are not honoring your API contract. |
5XXError | The number of server-side errors Useful statistic: Sum. You can also use Average to get an error rate. |
Similar to the Lambda errors, this will tell you when something is not working as expected with the APIs you’ve built. |
Count | The number of API requests Useful statistic: SampleCount |
Similar to the Lambda Invocations, it gives you an idea of demand. |
IntegrationLatency | How long API Gateway is waiting for something on the backend (e.g. Lambda) Useful statistic: Average Measured in: Milliseconds |
Similar to Lambda Duration, but keep in mind that API Gateway requests are limited to 29 seconds despite the 5 minute Lambda max. Keep an eye on it so your API requests don’t get close to timing out. |
Cache hit ratios and different ways of measuring latency are also available. Also, you can break things down by API or you can monitor individual stages, resources, and methods. Some of these require turning on more detailed CloudWatch Metrics at an extra cost.
DynamoDB
DynamoDB publishes a long list of metrics, but the following are some good ones to start with. Some are sent to CloudWatch every minute and others every five minutes.
Metric | Description | What to Do with It |
---|---|---|
ConsumedReadCapacityUnits | How many RCUs are being used Useful statistic: Sum |
Track how close you are to exceeding your provisioned throughput. |
ConsumedWriteCapacityUnits | How many WCUs are being used Useful statistic: Sum |
Track how close you are to exceeding your provisioned throughput. |
ReadThrottleEvents | How many read requests are throttled Useful statistic: Sum |
Know when you’ve exceeded read limits. |
WriteThrottleEvents | How many write requests are throttled Useful statistic: Sum |
Know when you’ve exceeded write limits. |
Note that there are some nuances to how these metrics apply to indexes.
Taking Things Further
That covers Lambda, API Gateway, and DynamoDB. Take a look at the other services being used by your applications, such as S3, SNS, SQS, and Kinesis. These have CloudWatch metrics as well.
You’ll also want to track application logs such as those coming from Lambda and API Gateway. You can create metric filters in CloudWatch Logs to know when your logs start reporting things of interest.
Route 53 healthchecks can be helpful too in monitoring the overall availability of application endpoints.
Some companies have entire departments dedicated to operational monitoring, so this process can be an ongoing one, but this should point you in the right direction if you’re just getting started.