Is it bad to not remove stopwords when I've already set a ceiling on document frequency?

I'm using sklearn.feature_extraction.text.TfidfVectorizer. I'm processing text. It seems standard to remove stop words. However, it seems to me that if I already have a ceiling on document frequency, meaning I will not include tokens that are in a large percent of the document (eg max_df=0.8), dropping stop words doesn't seem necessary. Theoretically, stop words are words that appear often and should be excluded should be excluded. This way, we don't have to debate on what to include in our list of stop words, right? It's my understanding that there is disagreement over what words are used often enough that they should be considered stop words, right? For example, scikit-learn includes "whereby" in its built-in list of English stop words.

1 answer

  • answered 2019-07-10 23:54 OmG

    You are right. It could be the definition of stop words. However, do not forget that one reason to remove the stop words in the first phase, is to prevent counting them and reduce the computation time.

    Notice that your intuition behind stop words is correct.