DataScience With Python/R/SAS: Supervised Learning| Model Evaluation Procedures | Cross Validation | Machine Learning | Sci-kit-Learn

Once you have defined your problem and prepared your data you need to apply machine learning algorithms to the data in order to solve your problem. You can spend a lot of time choosing, running and tuning algorithms. You want to make sure you are using your time effectively to get closer to your goal.
Evaluation procedure #1: Train and test on the entire dataset

Train the model on the entire dataset.
Test the model on the same dataset, and evaluate how well we did by comparing the predicted response values with the true response values.

Classification accuracy:

Proportion of correct predictions
Common evaluation metric for classification problems
When you train and test the model on the same data, its called as training accuracy

Problems with training and testing on the same data

Goal is to estimate likely performance of a model on out-of-sample data.
But, maximizing training accuracy rewards overly complex models that won't necessarily generalize.
Unnecessarily complex models overfit the training data, green line shows the over-fitting of model and the black line shows best fit for model.

Predication outside the sample data

Evaluation procedure #2: Train/test split

Split the dataset into two pieces: a training set and a testing set.
Train the model on the training set.
Test the model on the testing set, and evaluate how well we did.

What did this accomplish?

Model can be trained and tested on different data
Response values are known for the testing set, and thus predictions can be evaluated
Testing accuracy is a better estimate than training accuracy of out-of-sample performance

Split the dataset into training and test dataset

Train the model with the model dataset;then test it with test dataset and finally evaluate the models.

Training accuracy rises as model complexity increases.

Testing accuracy penalizes models that are too complex or not complex enough
For KNN models, complexity is determined by the value of K (lower value = more complex). Capturing testing accuracy for different values of K for KNN model.

Motivation: Need a way to choose between machine learning models

Goal is to estimate likely performance of a model on out-of-sample data

Initial idea: Train and test on the same data

But, maximizing training accuracy rewards overly complex models which overfit the training data

Alternative idea: Train/test split

Split the dataset into two pieces, so that the model can be trained and tested on different data
Testing accuracy is a better estimate than training accuracy of out-of-sample performance
But, it provides a high variance estimate since changing which observations happen to be in the testing set can significantly change testing accuracy

Steps for K-fold cross-validation

Split the dataset into K equal partitions (or "folds").

Use fold 1 as the testing set and the union of the other folds as the training set.

Calculate testing accuracy.

Repeat steps 2 and 3 K times, using a different fold as the testing set each time.

Use the average testing accuracy as the estimate of out-of-sample accuracy.

5-fold cross-validation:

Dataset contains 25 observations (numbered 0 through 24)

5-fold cross-validation, thus it runs for 5 iterations

For each iteration, every observation is either in the training set or the testing set, but not both

Every observation is in the testing set exactly once

Diagram

Kfold Cross Validation

Comparing cross-validation to train/test split

Advantages of cross-validation:

More accurate estimate of out-of-sample accuracy

More "efficient" use of data (every observation is used for both training and testing)

Advantages of train/test split:

Runs K times faster than K-fold cross-validation

Simpler to examine the detailed results of the testing process

Cross-validation example: parameter tuning

Goal: Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset

Cross-validation example: model selection

Goal: Compare the best KNN model with logistic regression on the iris dataset

DataScience With Python/R/SAS

Easy Pages

Supervised Learning| Model Evaluation Procedures | Cross Validation | Machine Learning | Sci-kit-Learn | Part-2

No comments:

Post a Comment