Part 1: Creating a Data Source
In the early 1970s, my grandmother worked as an office clerk for a Texas oil man. She performed many weekly tasks by hand, like writing checks, balancing accounting books, replying to correspondences, and making sales calls. By the 1980s, roughly half of her tasks were performed by computer systems and her workday was irrevocably transformed. We’re all living through another revolutionary advancement in technology right now. Machine learning and Artificial Intelligence will fundamentally alter every aspect of our daily lives within the next 10 to 20 years.
AWS provides Machine Learning as a service (Amazon ML) to help customers propel their businesses forward. There are several useful and interesting questions we could answer with Amazon ML, but let’s build something more fun. If you’re allergic to goofing off, take comfort in knowing that this project is related to an actual proof-of-concept I did for a customer.
I’m going to pretend that I work for a pet supply store that will have a week-long sale next month. The most effective ad strategy for my sale would be to target “dog people” with dog ads and “cat people” with cat ads. How can I determine whether my customers are cat or dog people if I don’t know what kind of animal they own? I want to send a quick survey to my customers and have them answer some basic questions to help me determine which camp they fit into.
For this project, I’ll just use some made up data. It’s important for any company to know who their customers are, and part of knowing about customers is having data about them. Having a ton of data about my users doesn’t necessarily provide a ton’s worth of understanding about them though.
The image below is a good representation of the type of data confusion that can occur. We cannot say much about the individuals in this photo except that they like the Seahawks and we can therefore assume that they are football fans.
This is akin to my pet supply data problems. I can say that my customers like the service I provide and they might even prefer my company to others; I can also assume they have pets. There isn’t much more we can do with this type of data without building some models to help us.
Amazon ML is a simple and easy service for creating the type of model we need.
First we need to create a data source from input data that our awesome survey conductors gathered. From the Amazon ML console, click the Create new … drop down menu and select Datasource and ML model. These instructions will focus explicitly on this step of the Machine Learning process. Input data within Amazon ML can be sourced from the following options: S3 or Redshift. We’re going to use S3 for this project because it’s an easy place to store our survey results.
After inputting the appropriate location for our data, Amazon ML will pull from this object to create our Cat/Dog model. Now we need to describe the schema of the data, so click the Verify option and move onto the next step. Amazon ML will do a great job at guessing the data types within our data set, but it’s always a good idea to double check any assumptions. The image below is what I see when Amazon ML searches my sample data. All of this looks correct, so I’ll click Continue and move on.
We must also identify a predictor column which will be the variable we want our model to determine as we send it new data. For example, if I want to be able to predict the height of all the children in a kindergarten class based upon their gender, weight, and eye color; I’d set up my predictor as the height column within my training set. For this data, I’ll establish the cat/dog column, representing whether an individual prefers cats or dogs, as the target column.
After setting a target, I can establish an identifier column if my data has one. For this project, I just selected the person variable. The row identifier can be useful if we want to compare the selected training data with the evaluation data after our model is built. I’m not concerned with this, so I selected the person column (we might come back and cover this in a later blog post).
The final step is to review all the choices we’ve made thus far. If everything looks good click Continue to begin creating the model. Amazon ML will save these settings for this type of data, and will be able to refer to these configurations for any new data I gather about my customers. Creating a Datasource is like making a template for my data that can be used later for making new models or updating older ones.
In Part 2 (coming Thursday), we’ll be creating the model that will predict our target values.