Introduction
I was very excited after first discovering AWS and all the Big Data capabilities and services they offer. The scalability, low costs, and ease of use were all huge benefits to the AWS Big Data environment. All the excitement then came to a screeching halt when I thought, “wait a minute, how do all the large amounts of data go from the source and into AWS?” Of course, AWS knows that use of their Big Data tools would be dependent on getting large amounts of data into the cloud. Fortunately, AWS is forward thinking enough to provide many different ways to ingest data. Unfortunately, AWS has provided so many different ways to ingest data that it can be overwhelming and confusing when it comes time to decide on which service to use.
In this blog post I hope to eliminate the confusion and complexity that can be overwhelming when first attempting to design and build Big Data ETL processes in AWS. I will catalog and categorize all the different options for data ingestion. I will also briefly describe each service and go over the best use cases for each of the services to make it easier to decide which service best suits your use case.
Data ingestion from sources into AWS falls into one of two categories: the data is processed either as a batch or as a stream. Those are the first two categories in table below. There are also a few AWS services that straddle both batch and stream processing, or in the case of Snowball does not fall into either one. I have chosen to categorize those as Other in the table.
Now that I have cataloged and categorized the AWS services, you have a nice overview of all the data ingestion options. The table provides a great starting point for narrowing down the services that may work best for your use cases. Next, I will go over each one with a brief description and typical use case. Then you will be able to further narrow the options to the specific service to get the job done.
Batch Processing
Batch processing is when source data is collected and gathered together for processing all at once at the end of the specific collection period. An example of a batch process would be data collected for bank transactions throughout the day, and then those transactions are gathered together and processed at the end of every day. Below are the AWS services that are best suited to do batch processing, and each service’s description and typical use case.
Data Pipeline
- AWS documentation –http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
- Brief description – Data Pipeline starts up EC2 instances and runs the data jobs that you configure and schedule.
- Typical use case – Archive server log files at the end of each day, process the files, and write the data to a database.
- Pricing – https://aws.amazon.com/datapipeline/pricing/
DMS (Database Migration Service)
- Description from AWS documentation –http://docs.aws.amazon.com/dms/latest/userguide/Welcome.html
- Brief description – DMS also starts up EC2 instances and runs data jobs on them. It differs from Data Pipeline by specializing on moving entire databases and replicating the same schema from the source database to the new database.
- Typical use case – Daily moving of data from an on-premises production database to an AWS database for retaining historic data from source.
- Pricing – https://aws.amazon.com/dms/pricing/
AWS Batch
- Description from AWS documentation –http://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html
- Brief description – AWS Batch starts up an ECS cluster and provisions containers to run data jobs on. AWS Batch is optimized for scalability. It can scale out and in based on data size.
- Typical use case – Pharmaceutical data that is gathered periodically and is sometimes large and sometimes small.
- Pricing – https://aws.amazon.com/batch/pricing/
AWS Glue (coming soon)
- Description from AWS documentation – ( http://aws-glue-beta-documentation.s3-website-us-west-2.amazonaws.com/glue-dg.pdf )
- Brief description – AWS Glue uses the EMR engine and Spark running on it to perform batch jobs. The jobs and transformations can be written in Python or SparkQL.
- Typical use case – Web tracking companies that are gathering very large amounts of click-stream data and periodically processing that data.
- Pricing – Coming Soon
Stream Processing
Stream processing is when source data is collected one record at a time or in small batches, and is gathered in real-time or near real-time to some originating event. An example of a stream process would be Twitter data that is collected as users are posting. Below are the AWS services that are best suited to do stream processing, and each service’s description and typical use case.
Kinesis
- Description from AWS documentation – http://docs.aws.amazon.com/streams/latest/dev/introduction.html
- Brief description – Kinesis is a queuing system that is optimized for large amounts of data and moving it in near real-time.
- Typical use case – A bank that processes large amounts of time sensitive monetary transactions to analyze each one for potential fraud, and then take action in real-time.
- Pricing – https://aws.amazon.com/kinesis/streams/pricing/
SQS
- Description from AWS documentation – http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/Welcome.html
- Brief description – Simple Queuing Service (SQS) moves data as messages between sources and targets.
- Typical use case – IT Department that wants to track configuration changes as they happen and send alerts in real-time.
- Pricing – https://aws.amazon.com/sqs/pricing/
Lambda
- Description from AWS documentation – http://docs.aws.amazon.com/lambda/latest/dg/welcome.html
- Brief description – Runs code without the need for provisioning servers. Lambda can run custom code to perform data tasks that cannot be done through any other tools.
- Typical use case – Company that runs a website and wants to process customized data and write that data to a database in real-time, as users are interacting with their website.
- Pricing – https://aws.amazon.com/lambda/pricing/
Other Services
There are a few AWS services that can be used for both batch and streaming jobs or that do not really fit in either category. They are described below.
EC2
- Description from AWS documentation – https://aws.amazon.com/documentation/ec2/
- Brief description – EC2 provides compute resources to run customized applications for either batch or stream processing.
- Typical use case – Company that needs to use IBM DataStage or Microsoft SSIS in order to integrate with their IBM or Microsoft environments to perform data processing.
- Pricing – https://aws.amazon.com/ec2/pricing/
S3
- Description from AWS documentation – http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html
- Brief description – S3 is Amazon’s Simple Storage Service. It is a distributed file system that can be used as an object store for any kind of data object and can come from any location with an internet connection. Also, it provides a very low-cost solution for storing even large amounts of data.
- Typical use case – Company that wants to store back-up data or data in unusual file types, and then process that data in batch or real-time.
- Pricing – https://aws.amazon.com/s3/pricing/
Snowball
- Description from AWS documentation – https://aws.amazon.com/snowball/
- Brief description – AWS Snowball provides a method for moving very large amounts of data, petabyte-scale into S3. The service is more of a one-time batch job to move very large amounts of data.
- Typical use case – Company that wants to move all the data they have into AWS could use Snowball as the initial load of data. They then would need to use one of the other data transfer methods to add subsequent regular amounts of data to AWS.
- Pricing – https://aws.amazon.com/snowball/pricing/
Conclusion
When it comes to AWS there are many different ways to get a job done. AWS provides multiple ways to accomplish something in order to cover all the possible use cases that their customers need. This is one of the great benefits to using AWS, but also can be confusing and complex. Hopefully this blog post provides a great reference and starting point for designing and building your next data processing job.
If you are interested in learning more, check out our upcoming AWS Big Data (Streaming) Boot Camp in Lehi, Utah. You can also check our Events Page for information about other 1Strategy AWS training opportunities in Lehi and Seattle.