How to set maxDF to pyspark.ml.feature.CountVectorizer even though there was no maxDF parameter?

My program was already working nicely using CountVectorizer from pyspark.ml package. But, this CountVectorizer doesn't have maxDF parameter like CountVectorizer in sklearn.feature_extraction.text package which remove term that appear too frequent in document list. Is there any way to apply that to CountVectorizer from pyspark.ml package?

1 answer

  • answered 2018-11-08 09:20 user10465355

    maxDF Param has been included in Spark 2.4.0 (not released officially yet, but already available from PyPi and Apache Foundation archives):

    • SPARK-23166 - Add maxDF Parameter to CountVectorizer
    • SPARK-23615 - Add maxDF Parameter to Python CountVectorizer

    and can be used as any other Param:

    from pyspark.ml.feature import CountVectorizer
    
    vectorizer = CountVectorizer(maxDF=99)
    

    or

    vectorizer = CountVectorizer().setMaxDF(99)
    

    To use it you'll have to either update Spark to 2.4.0 or later, or backport the corresponding PRs and build Spark from source.