How to set maxDF to even though there was no maxDF parameter?

My program was already working nicely using CountVectorizer from package. But, this CountVectorizer doesn't have maxDF parameter like CountVectorizer in sklearn.feature_extraction.text package which remove term that appear too frequent in document list. Is there any way to apply that to CountVectorizer from package?

1 answer

  • answered 2018-11-08 09:20 user10465355

    maxDF Param has been included in Spark 2.4.0 (not released officially yet, but already available from PyPi and Apache Foundation archives):

    • SPARK-23166 - Add maxDF Parameter to CountVectorizer
    • SPARK-23615 - Add maxDF Parameter to Python CountVectorizer

    and can be used as any other Param:

    from import CountVectorizer
    vectorizer = CountVectorizer(maxDF=99)


    vectorizer = CountVectorizer().setMaxDF(99)

    To use it you'll have to either update Spark to 2.4.0 or later, or backport the corresponding PRs and build Spark from source.