Once you have defined your problem and prepared your data you need to apply machine learning algorithms to the data in order to solve your problem. You can spend a lot of time choosing, running and tuning algorithms. You want to make sure you are using your time effectively to get closer to your goal.
Evaluation procedure #1: Train and test on the entire dataset
Testing accuracy penalizes models that are too complex or not complex enough
For KNN models, complexity is determined by the value of K (lower value = more complex). Capturing testing accuracy for different values of K for KNN model.
Evaluation procedure #1: Train and test on the entire dataset
- Train the model on the entire dataset.
- Test the model on the same dataset, and evaluate how well we did by comparing the predicted response values with the true response values.
Classification accuracy:
- Proportion of correct predictions
- Common evaluation metric for classification problems
- When you train and test the model on the same data, its called as training accuracy
Problems with training and testing on the same data
- Goal is to estimate likely performance of a model on out-of-sample data.
- But, maximizing training accuracy rewards overly complex models that won't necessarily generalize.
- Unnecessarily complex models overfit the training data, green line shows the over-fitting of model and the black line shows best fit for model.
Predication outside the sample data
Evaluation procedure #2: Train/test split
- Split the dataset into two pieces: a training set and a testing set.
- Train the model on the training set.
- Test the model on the testing set, and evaluate how well we did.
What did this accomplish?
- Model can be trained and tested on different data
- Response values are known for the testing set, and thus predictions can be evaluated
- Testing accuracy is a better estimate than training accuracy of out-of-sample performance
Train the model with the model dataset;then test it with test dataset and finally evaluate the models.
Training accuracy rises as model complexity increases.
For KNN models, complexity is determined by the value of K (lower value = more complex). Capturing testing accuracy for different values of K for KNN model.
Motivation: Need a way to choose between machine learning models
- Goal is to estimate likely performance of a model on out-of-sample data
Initial idea: Train and test on the same data
- But, maximizing training accuracy rewards overly complex models which overfit the training data
Alternative idea: Train/test split
- Split the dataset into two pieces, so that the model can be trained and tested on different data
- Testing accuracy is a better estimate than training accuracy of out-of-sample performance
- But, it provides a high variance estimate since changing which observations happen to be in the testing set can significantly change testing accuracy
Steps for K-fold cross-validation
- Split the dataset into K equal partitions (or "folds").
- Use fold 1 as the testing set and the union of the other folds as the training set.
- Calculate testing accuracy.
- Repeat steps 2 and 3 K times, using a different fold as the testing set each time.
- Use the average testing accuracy as the estimate of out-of-sample accuracy.
5-fold cross-validation:
- Dataset contains 25 observations (numbered 0 through 24)
- 5-fold cross-validation, thus it runs for 5 iterations
- For each iteration, every observation is either in the training set or the testing set, but not both
- Every observation is in the testing set exactly once
Diagram
Kfold Cross Validation
Comparing cross-validation to train/test split
Advantages of cross-validation:
- More accurate estimate of out-of-sample accuracy
- More "efficient" use of data (every observation is used for both training and testing)
Advantages of train/test split:
- Runs K times faster than K-fold cross-validation
- Simpler to examine the detailed results of the testing process
Cross-validation example: parameter tuning
Goal: Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset
Cross-validation example: model selection
Goal: Compare the best KNN model with logistic regression on the iris dataset
No comments:
Post a Comment
Note: only a member of this blog may post a comment.