Suppose we have a dataset split into 80% for training and 20% for validation, do you do A) or B?

Method A)

  1. Train on 80%
  2. Validate on 20%
  3. Model is good, train on 100%.
  4. Predict test set.

Method B)

  1. Train on 80%
  2. Validate on 20%
  3. Model is good, use this model as is.
  4. Predict test set.

In this post, I’ve posted this question on Kaggle and I’ll summarize the answers here.

For myself, I do A), with the following reasons aggregated:

  • More data is better. In case of time time series, including more recent data is always better.
  • Cross validation is used to validate the hyper-parameters to train a model, rather than the model itself. You then pick the best parameters to re-train a model.

References: