Tuesday, May 3, 2011

n-fold Cross Validation

Often the process of collecting training dataset is quite painstaking. Even worse, often one-third of data is useless and then what? So if one cheats and change his programming parameters that may result in the problem of overfitting.

One remedy for this problem is to use n-fold cross-validation.

In n-fold cross-validation, the original sample is randomly partitioned into n subsamples. Of the n subsamples, a single subsample is retained as the validation data for testing the model, and the remaining n − 1 subsamples are used as training data. The cross-validation process is then repeated n times (the folds), with each of the n subsamples used exactly once as the validation data. The n results from the folds then can be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. The result will be the average of the result of each fold [1].

As suggested by Prof. Belongie, 5-fold cross-validation will provide a decent average and decent standard deviation. But this is the step after collecting data and establishing ground truth labels.

[1] GJ, McLachlan; K.A. Do, C. Ambroise (2004). Analyzing microarray gene expression data. Wiley.

No comments:

Post a Comment