April 4th, 2017
Optimizing AWS S3 Data Upload
By Alex Graves

Recently, a 1Strategy customer came to me with a data migration problem: they needed to move data files totaling 50TB to S3. The engineers’ initial estimate was 80 days for the transfer. Sounds like a straightforward problem with a simple, but uncomfortable solution. They had quite a bit of data and needed it to travel across the public internet; that’s going to take a while. The challenge was that the business was demanding this migration be completed in 40 days. Yikes!

In a best-case scenario, it would take 12 days to upload 100TB on a 1GBps connection or 120 days on a 100MBps connection if the upload can consume 80% of the network bandwidth. In these scenarios, AWS recommends choosing Snowball over traditional upload methods when the data amounts to 60TB+ or 5TB+ respectively. Judging by the customer’s 80-day initial estimate, they probably don’t have a 1GBps connection, which means the data size necessitates Snowball.1 This was a problem because the data didn’t reside on-premises, where they could connect a Snowball device, but in Rackspace CloudFiles.

Pesky constraints like hosted data centers and deadlines can quickly turn a simple task into a difficult problem. Here were my recommendations in order of least expensive to most.

1. S3 Transfer Acceleration

This is a built-in feature of S3 that can help with this type of upload. By enabling an S3 bucket with transfer acceleration, you can utilize AWS Edge Locations to accelerate the upload of data. You will pay an additional fee for the increased transfer speed, but the estimated increase is between 50% to 400% depending on the conditions of the uploads. All S3 users should perform a calculation around moving the data with Transfer Acceleration before attempting any kind of S3 upload. I estimated that it would cost ~$4,000 just in transfer costs for this size of data, but it may be worth the performance benefits if you’re under time constraints. You should also be sure to parallelize the PUT operations by performing multipart uploads. 1Strategy can help optimize this type of upload.

2. Appeal to Rackspace

I had the opportunity to ask some AWS Account Managers about this specific problem, and I was surprised by their response. They suggested asking Rackspace about performing a Snowball transfer for you; they may be willing to do this because they’re a Premier AWS Partner. If a Snowball is not doable, I recommend asking them about their in-house AWS managing services. They may be willing to help you migrate the storage to S3 since this specific project isn’t motivated by a dissatisfaction with Rackspace products or services. I know that Rackspace has several AWS Direct Connect links that allow customers to build applications within AWS while maintaining a low-latency connection to back-end storage resources within Rackspace. For instance, if the data is stored in Rackspace’s ObjectRocket you can take advantage of these connections immediately.2

3. Third-Party Tools

Based on these time constraints, I also recommend using a third-party transfer product to meet the deadline. These services are usually not cheap. You’ll pay by the GB of data to be transferred, but it can be completed rather quickly. I’ve never worked with any third-parties for this type of problem, but I did some research into how much it might cost. Mover.io specializes in this type of large data set migration, and they charge about $1 per GB transferred. Another product would be CloudFastPath, which charges about the same, but offers discounts for larger datasets.

These three solutions should handle most large data upload cases where data is not stored on premises.

Finally, I sent them a whitepaper from Nasuni, a company that specializes in Enterprise file storage solutions. In 2015 they conducted a test of the transfer speeds achievable when migrating data between different Cloud Service Providers (CSPs), including a test moving data from Rackspace to S3. They moved a subset of one million files totaling 200GB, then extrapolated their results to estimate the time to migrate their entire 12TB dataset. There are very little configuration details for their tests, but the results may be interesting to you as well.

I hope these are helpful recommendations. Let me know if you have comments, other ideas, or if there’s anything I missed!

1 There are other considerations to factor in as well, including the individual file size, total number of files, and whether the transfer can be parallelized or not.

2 After I wrote this I found out that Rackspace is willing to load customer data on a Snowball device.