Higher Testing Accuracy and Lower Trainning Accuracy

I am rather new to the process of NLP, and I am running into a situation where my training accuracy is around 70% but my test accuracy is 80%. I have roughly 6000 entries from 2020 to be used as training data and 300 entires from first quarter of 2021 to be used as test data (due to unavailability of Q2,Q3,Q4 data). Each entire would have at least 2-3 paragraphs within them.

I have setup cross validation using RepeatedStratifiedKFold with 10 split and 3 repeat, and using grideserachCV with C=.1 and kernel = linear. Setup stop words (I did customized it somewhat such as include top 100 common names, month, as well as some of more common words that doesn't mean much in my setting), lowercased everything, and used Snowball stemmer. The resulting confusion matrix for the test set is as appeared

[[165  34]
[ 27  96]]

with F1 score of 81% However upon examing my trainning set it had means and std of 0.720 (+/-0.036)

I am trying to make out why there is a 9% difference between the trainning and test sets with test set getting a higher result as well as not sure what else I could do to further improve the accuracy.

My goal is to predict the unavailable data in Q2,Q3,Q4 and ultimately comparing those 3 when it is available

  • answered 2022-02-23 19:55 ewz93

    I am not really familiar with the model you use and might be mising something here, but it might be that your test set is not representative of the data. Perhaps there is something in the 2021 data that causes it to be easier to predict.

    You might want to try something like sklearn's train_test_split() with shuffle=True to ensure the test set is a representative random subset of the data and see if you get more balanced performances between the sets this way.

    Depending on which task exactly you are doing, 300 entries is really not a lot for a test set in NLP, so that small test set size alone might distort the test results.

    It is a bit difficult to give advise on how to generally improve the predictions without knowing what you generally are trying to do. I assume it has to do with doing some kind of two class classification on stemmed tokens?

    Can you clarify/give an example for an entry and the desired predictions?

