September 26th, 2017
What I Like About Glue
By Alex Graves

Glue is Amazon Web Services’ newest serverless tool. It’s made to help developers and data engineers extract, transform, and load (ETL) their data. The predecessor to Glue was Data Pipeline, a useful, but flawed service.[1] With Glue, data experts can now keep objects in S3 for inexpensive storage, trigger modification of those objects with PySpark Jobs, and migrate them to RDS Aurora (or another S3 bucket) for use by downstream applications.

I may just be a Romantic, but I’m a sucker for a good ETL workflow. I’ve been demoing the Glue service for a customer and the more I use it, the more things I learn about it.

If you’ve had the pleasure of viewing this screen, 

you may be frustrated and conflicted about this AWS native ETL tool. Don’t be discouraged. I, too, have had some frustrations, but I’d like to focus on what Glue gets right and help you learn from my mistakes.

Here are a few things I Like About You the new service:

 Execution

What I like about Glue,
it knows what to log.
Show me what I execute,
wanna cut through the fog. Yea.

You can see below that the execution of a Job logs the configurations for that Job in CloudWatch. This can be incredibly helpful if you use Glue to test Jobs, but you then want to run those Jobs on your own EMR clusters.

 Catalogs

Keep my catalog up to date.
Show me all the data my apps create,
‘cause that’s true.
That’s what I like. 

Glue is a great way to automate Big Data Jobs. It even has features to handle updates to your data lake. If you’re regularly using Athena, you’ll notice that the databases and tables created with Athena appear in the Glue Data Catalog. The Data Catalog is a single entry that feeds both tools. I found this out the hard way when I attempted to delete only my Glue resources and I wiped out a table that a coworker of mine was working with.

This also works the other way. When I create a database or table with Glue—manually or with the Crawler—those resources will show up in the Athena console. So, be aware that both services share the same pool of S3 data resources.

 

Scheduling

What I like about Glue,
it really knows how to work.
I can schedule any Job at any time,
and chaining them together’s a perk. Yea.

I can set up scheduled Jobs, kick off saved Jobs with Lambdas, and even trigger Jobs with the completion of earlier Jobs, like a work chain. The executions of each Job are saved in a history and, as I mentioned before, the logs are retained in CloudWatch. Errors during the execution of Glue Jobs are much more explicit than many other services I’ve worked with. From the Job execution history, I can see the Java error that my Job experienced and the exact value of the field that caused the problem. This is awesome, because slog through Java logs can be frustrating. 

 

Testing

Let me dev my ETL.
Let me test my scripts ‘til they’re workin’ well,
‘cause that’s true.
That’s what I like about Glue!

I love the Developer Endpoint feature. I can spin up an endpoint when I’m ready to build a pipeline then SSH into the Glue Spark shell (using the ENIs). After trying some data manipulations in a REPL fashion, I can have Glue build an EC2 instance to host Zeppelin (via CloudFormation) and build a PySpark script to be saved in S3. When I’m done, I just tear it all down; the best part is that if I want to go back and modify that script, the Zeppelin notebook will pull down the same PySpark script from S3! This design lets me collaborate with other Data Engineers on the same pipelines. It’s almost a source code management system for data scripting.

When I created a Dev Endpoint, the Glue service spun up 5 ENIs in the VPC that I assigned to house the endpoint.

With a minimum cost of 5 DPU per endpoint, my guess is that Glue uses one DPU per ENI. The documentation reports that Dev Endpoints require a minimum of 2 DPU, so my next experiment with Glue will be based around deleting some of these ENIs to reduce cost. Configuring the VPC and setting up the permissions for Glue to use is not a simple task; it’s worthy of its own blog article and I’ll probably write a walkthrough of that soon.

I was a bit surprised to learn that the Dev Endpoints accumulate cost while they’re active, not based on how much data is being computed on by Glue. I racked up a hefty bill by leaving the Dev Endpoint active for a few days even though I wasn’t trying any test scripts or interacting with the Glue shell. As a best practice, I would recommend using the Dev Endpoints only during business hours.

That’s what I like about Glue!

There is a lot to this new ETL service that AWS has created and I’m sure we’ll hear more about best practices as customers continue using it.

Let me know (Alex@1Strategy.com) if you think Glue might be a good fit for your latest ETL pipeline!

 

[1] The interface for Data Pipeline was not intuitive and it was frustratingly configuration heavy.