How can I voluntarily overfit my model for text classification

I would like to show an example of a model that overfit a test set and does not generalize well on future data.

I split the news dataset in 3 sets:

train set length: 11314
test set length: 5500
future set length: 2031

I am using a text dataset and build a CountVectorizer. I am creating a grid search (without cross-validation), each loop will test some parameters on the vectorizer ('min_df','max_df') and some parameter on my model LogisticRegression ('C', 'fit_intercept', 'tol', ...). The best result I get is:

({'binary': False, 'max_df': 1.0, 'min_df': 1},
{'C': 0.1, 'fit_intercept': True, 'tol': 0.0001},
 test set score: 0.64018181818181819,
 training set score: 0.92902598550468451)

but now if I run it on the future set I will get a score similar to the test set:

clf.score(X_future, y_future): 0.6509108813392418

How can I demonstrate a case where I overfitted my test set so it does not generalize well to future data?

2 answers

  • answered 2018-03-13 22:26 alsora

    You have a model trained on some data "train set".

    Performing a classification task on these data you get a score of 92%.

    Then you take new data, not seen during the training, such as "test set" or "future set".

    Performing a classification task on any of these unseen dataset you get a score of 65%.

    This is exactly the definition of a model which is overfitting: it has a very high variance, a big difference in the performance between seen and unseen data.

    By the way, taking into account your specific case, some parameter choices which could cause overfitting are the following:

    • min_df = 0
    • high C value for logistic regression (whihc means low regularization)

  • answered 2018-03-13 23:45 Tryer

    I wrote a comment on alsora's answer but I think I really should expand on it as an actual answer.

    As I said, there is no way to "over-fit" the test set because over-fit implies something negative. A theoretical model that fits the test set at 92% but fits the training set to only 65% is a very good model indeed (assuming your sets are balanced).

    I think what you are referring to as your "test set" might actually be a validation set, and your "future set" is actually the test set. Lets clarify.

    You have a set of 18,845 examples. You divide them into 3 sets.

    Training set: The examples the model gets to look at and learn off of. Every time your model makes a guess from this set you tell it whether it was right or wrong and it adjusts accordingly.

    Validation set: After every epoch (time running through the training set), you check the model on these examples, which its never seen before. You compare the training loss and training accuracy to the validation loss and validation accuracy. If training accuracy > validation accuracy or training loss < validation loss, then your model is over-fitting and training should stop to avoid over-fitting. You can either stop it early (early-stopping) or add dropout. You should not give feedback to your model based on examples from the validation set. As long as you follow the above rule and as long as your validation set is well-mixed, you can't over-fit this data.

    Testing set: Used to assess the accuracy of your model once training has completed. This is the one that matters because its based on examples your model has never seen before. Again you can't over-fit this data.

    Of your 18,845 examples you have 11314 in training set, 5500 in the validation set, and 2031 in the testing set.