September 6th, 2018
Glue Classifiers
By Alex Graves

The AWS Glue service provides a number of useful tools and features. One of the best features is the Crawler tool, a program that will classify and schematize the data within your S3 buckets and even your DynamoDB tables. Each Crawler records metadata about your source data and stores that metadata in the Glue Data Catalog.

Crawlers are very good at determining the schema of your data, but they can be incorrect from time to time. For instance, if you have a non-standard type of log format, the Crawler will not know quite how to schematize the data. These instances will require you to build a custom classifier to handle data schema.

There are three types of custom crawlers you can create in Glue: an XML classifier, a JSON classifier, and a Grok classifier. Today I’m going to explain how to create a custom Grok classifier. It’s just like building a Logstash Grok filter. If you’ve never used Logstash before, you may find this most helpful.

There are two key pieces to creating a Grok Classifier: a regular expression and a Grok expression. A Grok expression consists of a ‘pattern,’ a ‘field-name,’ and an optional ‘data-type.’ These three attributes are combined to make the full Grok expression like so:

%{pattern:field-name:data-type}

This constitutes a single Grok expression. We can combine multiple expressions to create a single filter. The ‘pattern’ section corresponds to a labeled regular expression. The Glue Classifier uses the Grok filter to parse each line of our data using the specific regular expressions you’ve identified. While I’m creating my Grok patterns, I like to use https://grokdebug.herokuapp.com/. It’s a web-based pattern tester and it will come in handy for sure.

Let’s take a look at an example. If I have log data that looks like this:

DEBUG 2018-07-12 ERFIV-23869 “HELLO, HOW ARE YOU?”

The Glue Crawler may have trouble identifying each field of this data, so we can build a custom classifier for it. This data contains fields for log level, date, userID, and a message. Thankfully, the Glue service has a built-in pattern for log level and date, so we only need to build a custom pattern for the other two fields. The regular expression syntax I use to recognize the userID and message fields for this Grok Classifier may look like this:

USERID ([A-Z]{4}-[0-9]{5})

MESSAGE (\”.\”)

When I create a Grok expression from these regular expressions they will look like this:

%{USERID:user:string}

%{MESSAGE:comment:string}

I can combine these custom patterns with the Glue built-in patterns to create a custom Classifier for this data. You can think of this Classifier as a definition of each column represented in your data set. The final Grok pattern will look like this:

%{LOGLEVEL:log} %{DATE_US:date} %{USERID:user:string} %{MESSAGE:comment:string}

Notice that I didn’t set a data type for each field. The Glue service will assign a data type to those fields that I don’t define. After associating my Crawler with this custom classifier, I can send the Crawler to collect metadata about my logs in S3. When the Crawler applies the Classifier to my data, it will match each line in my data with the Grok pattern and store that schema in the Data Catalog. The name of each field in my data will correspond to the field-name for each Grok expression. When I query my data with Athena, the table will show four columns: log, date, user, and comment.

Hopefully, this has given you an example of how to make a custom Glue Classifier and some context about when to use them. If you have non-standard log data or some specialized space delimited data that are stumping your Crawler, then Grok patterns are the way to go.

If you are interested in learning more about how 1Strategy can help optimize your AWS cloud journey and infrastructure, please contact us for more information at info@1Strategy.com.