What Is Cross Validation ?


Cross-validation

There is always a need to validate the stability of machine learning model, how well our model generalises to new, unseen data.

Cross-validation is statistical method to evaluate the generalisation performance in a more stable and thorough way than spilt into training and test set. It's a model validation technique for assessing how the results of a statistical analysis will generalise to an independent data set. It's mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model perform in real world.

The goal of cross-validation is to define a data set to test the model in the training phase in order to limit the problem like overfitting and underfitting.

What is overfitting and underfitting?

  • Underfitting refers to not capture enough patterns in the data. The model performs poorly both in the training and test set.
  • Overfitting refers: a) capturing noise and b) capturing patterns which do not generalise well to unseen data. The model performs extremely well in the training set but poorly on test set.

  • The optimal model should perform well both in train and test set.

In cross-validation, instead of splitting into a training set and a test set, the data is splitting repeatedly and multiple data are trained.

K-Fold cross-validation

The most commonly used version of cross-validation is k-fold cross-validation, where k is a user specified number, usually five or ten. When performing five-fold cross-validation, the data is first partitioned into five parts of equal size, called folds.

Next, a sequence of models is trained. The first model is trained using the first fold as the test set, the remaining folds 2-5 as the training set. The model is build using the data in the folds 2-5, and then the accuracy is evaluated on fold 1.

Then, another model is build, this time using fold 2 as the test set, and the data in folds 1,3,4 and 5 as the training set.

This process is repeated using the folds 3, 4, and 5 as test sets.For each of these five spilts of the data into training and test set, we computed the accuracy. In the end, we have calculated five accuracy values.

Implementation in python

Cross-validation is implemented in scikit-learn using the cross_val_score function from the model_selection module. The parameters of the cross_val_score function are the model we want to evaluate, the training data and the ground-truth labels. Let’s evaluate LogisticRegression on the iris dataset:

[ 0.961 0.922 0.958]

By default, cross_val_score performs three-fold cross-validation, returning three accuracy values.

We can change the number of folds used by changing the cv parameter, and a common way to summarize the cross-validation accuracy is to compute the mean:

array([ 1. , 0.967, 0.933, 0.9 , 1. ])
0.96000000000000019

Stratified K-Fold cross-validation

Splitting the dataset into k-folds by starting with the first 1/k-th part of the data as described above might not always be a good idea.Let’s have a look at the iris dataset for example:

iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

As you can see above, the first third of the data is class 0, the second third is the class 1, and the last third is class 2. Image doing three-fold cross-validation on this dataset. The first fold would be only class 0, so in the first split of the data, the test set would be only class zero, and the training set would be only class 1 and 2.

As the classes in the training and test set would be different for all three splits, the three-fold cross-validation accuracy would be zero in this dataset.

For such problems, a slight variation in k-fold cross-validation is made, such that each fold contains approximately the same percentage of samples of each target as the complete set.This method is known as stratified k-fold .

Leave-One-Out cross-validation

Another frequently used cross-validation method is leave-one-out. You can think of leave-one-out cross-validation as k-fold cross-validation where each fold is a single sample. For each split, you pick a single data point to be the test set. This can be very time-consuming, in particular for a large datasets, but sometimes provide better estimates on small datasets.