sklearn.pipeline.Pipeline: Fitting CountVectorizer in different corpus than training text

I am going through the Sample pipeline for text feature extraction and evaluation example from the scikit-learn documentation. In there, they show the following pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier

pipeline = Pipeline(
    [
        ("vect", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        ("clf", SGDClassifier()),
    ]
)

which they later proceed to use with GridSearchCV. In the example they fit the CountVectorizer on the training dataset and then extract the features. What I am looking to do is to fit the CountVectorizer on a bigger corpus and then apply it to the training data to obtain the feature vectors. Is there a straightforward way of doing so while maintaining the sklearn.pipeline.Pipeline API i.e., without subclassing sklearn.pipeline.Pipeline and significantly changing its methods?

I want to maintain the sklearn.pipeline.Pipeline API as I am looking to make use of GridSearchCV and having it structured in this manner will be quite convenient and clean.

1 answer

  • answered 2022-04-28 03:46 qaiser

     from sklearn.feature_extraction.text import CountVectorizer
     # supppose corpus is your big corpus 
      corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',]
     
     # first train it on big corpus , and get the feature name from that
     vectorizer = CountVectorizer()
     X = vectorizer.fit_transform(corpus)
    
    # now train your new dataset using the vocabulary from the above training datasert
    
     vocabulary  = vectorizer.get_feature_names() 
    
     new_train_corpus = ["how are you doing", "I am fine", "I am reading first document"]
     new_vect = CountVectorizer(vocabulary = vocabulary) #using vocabulary from previous training here 
     new_vect.fit_transform(new_train_corpus)
    
     new_vect.get_feature_names()
     #op all new vocabulary will get ignored , and vectorizer object will used only this vocabulary 
     
     ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
    

    Note if you have fixed list of keyword than directly you can pass in your vocab , but if you want to train and do feature selection and train it and than use that vocabulary in your training dataset

    In the documentation it is given how to use GridsearchCv with Pipeline https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py

    pipeline = Pipeline(
    [
        ("vect", CountVectorizer(vocabulary = vocabulary)), ## pass vocabulary here
        ("tfidf", TfidfTransformer()),
        ("clf", yourmodel()),
    ]
     ) 
    

    set the parameter according to your need and pass it in GridSearchCV

     grid_search = GridSearchCV(pipeline, parameters)
     
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum