July 18th, 2018
AWS Glue and your Data Lake
By Alex Graves

One of the best resources for cloud architects is the AWS Answers page. It’s a repository of useful documentation around specific solutions within AWS. A quick google search of “Data Lake on AWS” will lead you to this AWS Answers page where you can find information on a variety of topics. We’re specifically looking at the Big Data, Data Lake Solution page.

The AWS Solution for a Data Lake has some great documentation around best practices, FAQs, and even provides some CloudFormation you can deploy into your account. This solution is a great place to start building your Data Lake, but there are some additional features you can use to increase the capabilities of your Data Lake.

The AWS Data Lake solution was recently updated. Below are some ideas about the most effective use of AWS Glue in this architecture.

Crawlers

The crawlers are a great way to catalog and track data in your Data Lake. The storage layer of your Data Lake is going to be S3, but Glue can keep track of what objects you are putting into and taking out of your buckets. The AWS solution mentions this, but it doesn’t describe how crawlers can be used to catalog data in RDS instances or how crawlers can be scheduled. If you create a crawler to catalog your Data Lake, you haven’t finished building it until it’s scheduled to run automatically, so make sure you schedule it. The last thing you want is for Glue to overlook data landing in your S3 bucket.

Jobs

The Jobs feature of Glue will allow you to build ETL workloads for any data within the Data Lake. If there are flat files uploaded to your S3 bucket that need to be loaded into your RDS instance overnight, a Glue job can handle that. This tool eliminates the need to spin up infrastructure just to run an ETL process. Instead, Glue will execute your PySpark or Scala job for you.

Triggers

The AWS Glue service features a trigger functionality that lets you kick off ETL jobs on a regular schedule. You can schedule jobs to run and then trigger additional jobs to begin when others end. This will let you chain ETL jobs together for more complex workflows. All of the job executions are logged in CloudWatch as well, so you’ll have great visibility into errors or failures.

Dev Endpoints

What better way to expose Data Lake resources to your Data Scientists than with their own endpoint? You can even spin up an EC2 instance running Apache Zeppelin that they can use to develop scripts against your data. The AWS solution identifies the Athena service as a way to explore your data in S3, but Data Scientists will need a more interactive way to explore and visualize that data.

These features of Glue will make your Data Lake more manageable and useful for your organization. The last thing you want from a Data Lake is for it to become a data swamp, an unmanageable mess of data.

If you are interested in learning more about how 1Strategy can help optimize your AWS cloud journey and infrastructure, please contact us for more information at info@1Strategy.com.