Problem with negative numbers in sklearn.feature_selection.SelectKBest feautre scoring module
I was trying auto feature engineering and selecting, so for that, I used the Boston house price dataset available in
from sklearn.datasets import load_boston import pandas as pd data = load_boston() x = data.data y= data.target y = pd.DataFrame(y)
Then I implemented the feature transformation library on the dataset.
import autofeat as af clf = af.AutoFeatRegressor() df = clf.fit_transform(x,y) df = pd.DataFrame(df)
After this, I implemented another function to find the score of each feature in relation to the label.
from sklearn.feature_selection import SelectKBest, chi2 X_new = SelectKBest(chi2, k=20) X_new_done = X_new.fit_transform(df,y) dfscores = pd.DataFrame(X_new.scores_) dfcolumns = pd.DataFrame(X_new_done.columns) featureScores = pd.concat([dfcolumns,dfscores],axis=1) featureScores.columns = ['Specs','Score'] print(featureScores.nlargest(10,'Score'))
This gave error as following.
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-16-b0fa1556bdef> in <module>() 1 from sklearn.feature_selection import SelectKBest, chi2 2 X_new = SelectKBest(chi2, k=20) ----> 3 X_new_done = X_new.fit_transform(df,y) 4 dfscores = pd.DataFrame(X_new.scores_) 5 dfcolumns = pd.DataFrame(X_new_done.columns) ValueError: Input X must be non-negative.
I had a few negative numbers in my dataset. So how can I overcome this problem?
df has now transformations of
y, its only having transformations of
You have a feature with all negative values:
0 -3630.638503 1 -2212.931477 2 -4751.790753 3 -3754.508972 4 -3395.387438 ... 501 -2022.382877 502 -1407.856591 503 -2998.638158 504 -1973.273347 505 -1267.482741 Name: exp(x005)*log(x000), Length: 506, dtype: float64
Quoting another answer (https://stackoverflow.com/a/46608239/5025009):
The error message
Input X must be non-negativesays it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently,
sklearn.feature_selection.chi2asserts the input is non-negative.
In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to
[0, 1]interval as suggested by EdChum.
If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:
sklearn.feature_selection.f_regressioncomputes ANOVA f-value
sklearn.feature_selection.mutual_info_classifcomputes the mutual information
Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.