What's the right way to insert a CalibratedClassifierCV in a scikit-learn pipeline?

I am trying to add a calibration step in a sklearn pipeline to obtain a calibrated classifier and thus have more trustworthy probabilities in output.

So far I clumsily tried to insert a 'calibration' step using CalibratedClassifierCV along the lines of (silly example for reproducibility):

import sklearn.datasets
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

data = sklearn.datasets.fetch_20newsgroups(categories=['alt.atheism', 'sci.space'])
df = pd.DataFrame(data = np.c_[data['data'], data['target']])\
       .rename({0:'text', 1:'class'}, axis = 'columns')

my_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', SGDClassifier(loss='modified_huber')),
    ('calibrator', CalibratedClassifierCV(cv=5, method='isotonic'))
])

my_pipeline.fit(df['text'].values, df['class'].values)

but that doesn't work (at least not in this way). Does anyone have tips about how to properly do this?

1 answer

  • answered 2018-04-14 15:31 Ami Tavory

    The SGDClassifier object should go into the CalibratedClassifierCV's base_estimator argument. CalibratedClassifierCV is a meta-estimator.

    Your code should probably look something like this:

    my_pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer()),
        ('classifier', CalibratedClassifierCV(base_estimator=SGDClassifier(loss='modified_huber'), cv=5, method='isotonic'))
    ])
    

    CalibratedClassifierCV is a meta-estimator.