Part 2: Building a Model
In Part 1, we focused on creating a datasource to use in Amazon Machine Learning. This datasource has a schema outlined for our data and a target column chosen for what our model should predict. Now we can begin building our model. If you missed Part 1, check it out here.
When I return to the Amazon ML console, I’ll see a list of saved objects. I’ll click on the Cat-Dog Data object to build a model from this datasource. In the Datasource Information section, there’s a drop-down menu labeled Use this datasource to; I’ll click on this drop down and select Create (train) an ML model. The next page will look like this:
I’ll select the Default settings because I don’t need to do anything fancy. I would select the Custom settings if I have separate training and evaluation data sets or if I wanted to shuffle my data before choosing a training subset. For now, the default is okay, so I’ll click Review and move on. After reviewing my model settings and clicking Create ML model, we should see this page:
At this stage, Amazon ML is creating a training subset of our data. This is the data set that will be used to instruct our model about how to differentiate between “cat people” and “dog people.” It is 70% of our input datasource by default (you can modify this in the custom model settings). After training a model with this data, Amazon ML will begin evaluating the effectiveness of the model with the remaining 30% of data. Creating the model shouldn’t take too long, and once it’s done we’ll click on Evaluation: Cat-Dog … under the Evaluations section on the left-hand navigation bar.
Well, how about that! It’s a perfect model. It predicted cat people and dog people with 100% accuracy. With real-world data, this is surely a sign of a problem. This is a red flag that indicates our model has been over fit (maybe we’ll do another blog on this topic, too), or that we don’t have a substantial test set of data. I’m not too worried about this now because all this data is fabricated, so I kind of expected it to be ridiculously good at distinguishing between the two types of people.
After clicking on the Explore performance button, we can see a breakdown of the model’s evaluation. In my data, a 0 on the cat/dog column means that the person is a dog person, while a 1 indicates a cat person.
I want my model to be very sure when it predicts a cat person, so I’m going to set the threshold at 0.90. This means that a score must be above 0.90 for the observation to be deemed a cat person. When I make this change, the error rate for this model becomes 13%. The Evaluation Summary page will break down how many erroneous predictions I should expect to see and what type of errors they will be. At this threshold level, the model would incorrectly classify 13% of individuals.
One of the individuals in my sample would be labeled as a dog person, but is really a cat person. This is known as a false negative. The model labeled this person as not liking cats, but in reality, that person does like cats. We might’ve seen errors in the other direction, too. The model could predict false positives, meaning that it thought a person did like cats, but in reality, that person hates cats and can’t understand why people keep them as pets.
Adjusting the threshold value will affect the number of false positive and false negative predictions, and choosing the correct threshold is directly related to the business purpose of the model. In our case, I don’t mind if there are many false negative values because this will just mean that 13% of cat people will receive a dog-themed ad for my sale. In the grand scheme of my business, that’s not a big deal. However, if I were trying to prevent fraud, a false negative would be a big problem and I’d probably leave the threshold value fairly low.
Now that we have a dataset and a model accurate enough for our purposes, we need to set up an endpoint and start sending data to Amazon ML for classification. In Part 3, next week, we will cover sending new data to Amazon ML in real-time.