Problem with negative numbers in sklearn.feature_selection.SelectKBest feautre scoring module
I was trying auto feature engineering and selecting, so for that, I used the Boston house price dataset available in sklearn
.
from sklearn.datasets import load_boston
import pandas as pd
data = load_boston()
x = data.data
y= data.target
y = pd.DataFrame(y)
Then I implemented the feature transformation library on the dataset.
import autofeat as af
clf = af.AutoFeatRegressor()
df = clf.fit_transform(x,y)
df = pd.DataFrame(df)
After this, I implemented another function to find the score of each feature in relation to the label.
from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(chi2, k=20)
X_new_done = X_new.fit_transform(df,y)
dfscores = pd.DataFrame(X_new.scores_)
dfcolumns = pd.DataFrame(X_new_done.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
print(featureScores.nlargest(10,'Score'))
This gave error as following.

ValueError Traceback (most recent call last)
<ipythoninput16b0fa1556bdef> in <module>()
1 from sklearn.feature_selection import SelectKBest, chi2
2 X_new = SelectKBest(chi2, k=20)
> 3 X_new_done = X_new.fit_transform(df,y)
4 dfscores = pd.DataFrame(X_new.scores_)
5 dfcolumns = pd.DataFrame(X_new_done.columns)
ValueError: Input X must be nonnegative.
I had a few negative numbers in my dataset. So how can I overcome this problem?
Note: df
has now transformations of y
, its only having transformations of x
.
1 answer

You have a feature with all negative values:
df['exp(x005)*log(x000)']
returns
0 3630.638503 1 2212.931477 2 4751.790753 3 3754.508972 4 3395.387438 ... 501 2022.382877 502 1407.856591 503 2998.638158 504 1973.273347 505 1267.482741 Name: exp(x005)*log(x000), Length: 506, dtype: float64
Quoting another answer (https://stackoverflow.com/a/46608239/5025009):
The error message
Input X must be nonnegative
says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently,sklearn.feature_selection.chi2
asserts the input is nonnegative.In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to
[0, 1]
interval as suggested by EdChum.If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:
sklearn.feature_selection.f_regression
computes ANOVA fvaluesklearn.feature_selection.mutual_info_classif
computes the mutual information
Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.