When working with customers who are new to AWS, one of the first things we recommend is turning on CloudTrail. CloudTrail is an AWS service that monitors every API call made to your AWS account and makes a record of it in S3. These events can also be stored in CloudWatch Logs. API calls are made whenever anyone interacts with AWS, including through the console, CLI, SDKs, and raw APIs. With these logs in place, you can audit all activity on your account and answer questions such as:
- Who terminated that instance?
- Who granted access to that user?
- What access keys were used to delete that resource?
CloudTrail also helps you meet compliance regulations.
CloudTrail logs are stored across a series of S3 objects within an elaborate hierarchy. It can be cumbersome to browse through thousands of objects, looking for a particular event. Scripts and analytic tools such as EMR/Hadoop can be used, but it’s not a trivial task.
Luckily, Amazon recently announced a new service called Athena, which allows you to query S3 data, including CloudTrail logs, using SQL. Despite data being stored across a number of objects, it can be queried as though it’s sitting in a relational database.
I’d like to show you how simple the service is and how you can use it to audit CloudTrail activity.
To get started, login to your AWS account and go to Athena. If this is your first time in the service, you may see a welcome page or a guided tutorial. Go to the Query Editor and click in the text area that shows an example query. We’ll start by creating a database, which is simply a way of grouping tables in Athena. Enter the following and click Run Query:
We now have a database named cloudtrail and we’ll want to choose it from the drop down on the left side of the page before moving forward.
Next, we’ll create a table. Tables are essentially a way of mapping S3 data to relational counterparts. Amazon makes this process straightforward by providing documented examples of some supported formats. Below is an example of how you would create a table to represent your CloudTrail logs:
CREATE EXTERNAL TABLE logs ( eventversion STRING, userIdentity STRUCT< type:STRING, principalid:STRING, arn:STRING, accountid:STRING, invokedby:STRING, accesskeyid:STRING, userName:String, sessioncontext:STRUCT< attributes:STRUCT< mfaauthenticated:STRING, creationdate:STRING>, sessionIssuer:STRUCT< type:STRING, principalId:STRING, arn:STRING, accountId:STRING, userName:STRING>>>, eventTime STRING, eventSource STRING, eventName STRING, awsRegion STRING, sourceIpAddress STRING, userAgent STRING, errorCode STRING, errorMessage STRING, requestId STRING, eventId STRING, resources ARRAY<STRUCT< ARN:STRING, accountId:STRING, type:STRING>>, eventType STRING, apiVersion STRING, readOnly BOOLEAN, recipientAccountId STRING, sharedEventID STRING, vpcEndpointId STRING, -- added: requestParameters STRING, responseElements STRING, additionalEventData STRING, serviceEventDetails STRING ) ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde' STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://bucket-name/AWSLogs/account-id/';
Note that we are creating a table called logs and it will have a number of different columns for each of the different properties of the CloudTrail records.
Note also that I have added 4 additional columns at the bottom. These are not included in the example Amazon provides because their structure varies depending on the service involved in the CloudTrail event. I have found, though, that much of the valuable information needed when auditing CloudTrail records is contained within these fields. While we may not be able to identify each of the child properties of these fields, we can treat them as simple strings, which allows us to query off of their contents.
Suppose you wanted to figure out who created an IAM user. The userIdentity.userName column would contain the user name of the person who created the user and requestParameters would include information that was sent along with the request (i.e. the user name of the person being created). For this reason, our query would need to include a WHERE clause that includes this column in order to find the creation event.
Note also that you’ll need to update bucket-name and account-id to reflect the location where your logs are stored. If you specified an S3 prefix when setting up CloudTrail, you’ll also want to add that to the path right after the bucket-name.
Click Run Query to create the table.
Now that the table has been created, we can start querying data. Below is an example of how you would query all CloudTrail logs for events performed by the john.smith user or events that involved his user (performed by someone else). Here we are including only the event time, the user performing the action, the name of the event, and the request and response information involved in the request. We are also sorting it by the event time.
select eventTime, userIdentity.userName, eventName, requestParameters, responseElements from logs where -- filter to include only requests by user john.smith -- or requests by others that involved his user account userIdentity.userName like '%john.smith%' or requestParameters like '%john.smith%' or responseElements like '%john.smith%' order by eventTime
Click Run Query and wait for the query to execute. This may take a minute or so. The great thing about Athena is that you can run multiple queries at the same time. If you go to the History tab at the top of the page, you can see all executing and completed queries.
When the query is finished, you’ll see the result set:
You can even download your findings as a CSV file and save queries to execute again later. For a list of the different CloudTrail fields available to you, check out this reference.
In summary, Athena and CloudTrail make a great combination and allow you to see what’s going on in your AWS account within a matter of minutes.