Is there a way to use mutual information as part of a pipeline in scikit learn?

I'm creating a model with scikit-learn. The pipeline that seems to be working best is:

  1. mutual_info_classif with a threshold
  2. PCA
  3. LogisticRegression

I'd like to do them all using sklearn's pipeline object, but I'm not sure how to get the mutual info classification in. For the second and third steps I do:

pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
        ('dim_red', pca),
        ('pred', lr)

But I don't see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?

2 answers

  • answered 2022-05-06 15:37 user2246849

    You can implement your Estimator by subclassing BaseEstimator. Then, you can pass it as estimator to a SelectFromModel instance, which can be used in a Pipeline:

    from sklearn.feature_selection import SelectFromModel, mutual_info_classif
    from sklearn.linear_model import LogisticRegression
    from sklearn.base import BaseEstimator
    from sklearn.pipeline import Pipeline
    from sklearn.decomposition import PCA
    X = [[ 0.87, -1.34,  0.31 ],
         [-2.79, -0.02, -0.85 ],
         [-1.34, -0.48, -2.55 ],
         [ 1.92,  1.48,  0.65 ]]
    y = [0, 1, 0, 1]
    class MutualInfoEstimator(BaseEstimator):
        def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
            self.discrete_features = discrete_features
            self.n_neighbors = n_neighbors
            self.copy = copy
            self.random_state = random_state
        def fit(self, X, y):
            self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features, 
                                                            copy=self.copy, random_state=self.random_state)
    feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
    pca = PCA(random_state=100)
    lr = LogisticRegression(random_state=200)
    pipe = Pipeline(
            ('feat_sel', feat_sel),
            ('pca', pca),
            ('pred', lr)
                    ('pca', PCA(random_state=100)),
                    ('pred', LogisticRegression(random_state=200))])

    Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.

    Yeah, I do not think there is another way to do it. At least not that I know!

  • answered 2022-05-07 06:20 Sanjar Adilov

    How about SelectKBest or SelectPercentile:

    from sklearn.feature_selection import SelectKBest
    mi_best = SelectKBest(score_func=mutual_info_classif, k=10)
    pca = PCA(random_state=100)
    lr = LogisticRegression(random_state=200)
    pipe = Pipeline(
            ('select', mi_best),
            ('dim_red', pca),
            ('pred', lr),

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum