Calculate residual deviance from scikitlearn logistic regression model
Is there any way to calculate residual deviance of a scikitlearn logistic regression model? This is a standard output from R model summaries, but I couldn't find it any of sklearn's documentation.
1 answer

You cannot do it in scikitlearn but check out statsmodels,
GLMResults
(API)
See also questions close to this topic

5 fold cross validation of time series data
I'm doing some electricity load forecasting in which I need to split the dataset into custom training and testing sets. For this I've got 3 years of electricity load data starting from 01012015 00:30:00 to 01012018 00:00:00. The load is recorded every half hour. Now I want to split the training and evaluation periods into 312 sets of 84 consecutive hours (36+48) and use them to define the 5fold validation splitting.
One can notice that odd numbered sets starts at 00:00 and even numbered starts at 12:00. This also happens with the missing spots in the evaluation period. As for the 5fold splitting, it has to be done like this:
 Validation folds: Fold 1 contained sets 1, 6, 11 and so on as validation set and the remaining as training set. Fold 2, contained 2, 7, 12, ..., as validation, and analogous splitting were done for folds 3, 4 and 5.
 Test fold: Contained all available filled data and was used to build models to predict missing data.
I'm struck with the splitting of dataset to form the validation folds. I tried using kfold and timeseries splitting from sklearn but didn't get exactly what I wanted.
import pandas as pd import numpy as np import matplotlib.pyplot as plt #Data preprocessing state = {0: 'NSW', 1: 'QLD', 2: 'SA', 3: 'TAS', 4: 'VIC'} year = {0: '2015', 1: '2016', 2: '2017'} #year = {0: '2017'} df_nsw = pd.DataFrame() df_qld = pd.DataFrame() df_sa = pd.DataFrame() df_tas = pd.DataFrame() df_vic = pd.DataFrame() df_nsw_test = pd.DataFrame() df_qld_test = pd.DataFrame() df_sa_test = pd.DataFrame() df_tas_test = pd.DataFrame() df_vic_test = pd.DataFrame() df = {'NSW': df_nsw, 'QLD': df_qld, 'SA': df_sa, 'TAS': df_tas, 'VIC': df_vic} df_test = {'NSW': df_nsw_test, 'QLD': df_qld_test, 'SA': df_sa_test, 'TAS': df_tas_test, 'VIC': df_vic_test} for st in state.values(): for ye in year.values(): for mn in range(1,13): if mn < 10: dataset = pd.read_csv('./datasets/train/' + st + '/PRICE_AND_DEMAND_' + ye + '0' + str(mn) +'_' + st + '1.csv') else: dataset = pd.read_csv('./datasets/train/' + st + '/PRICE_AND_DEMAND_' + ye + str(mn) +'_' + st + '1.csv') df[st] = df[st].append(dataset.iloc[:,1:3]) df[st] = df[st].set_index('SETTLEMENTDATE') for st in state.values(): dataset = pd.read_csv('./datasets/test/' + st + '/PRICE_AND_DEMAND_201801_' + st + '1.csv') df_test[st] = df_test[st].append(dataset.iloc[:,1:3]) df_test[st] = df_test[st].set_index('SETTLEMENTDATE') plt.plot(df['NSW'].iloc[:,0].values) plt.show() plt.plot(df['QLD'].iloc[:,0].values) plt.show() plt.plot(df['SA'].iloc[:,0].values) plt.show() plt.plot(df['TAS'].iloc[:,0].values) plt.show() plt.plot(df['VIC'].iloc[:,0].values) plt.show()
The dataset is uploaded here. Any suggestion ?

Including ssl module in py2app
I am trying to add the ssl module in my plugin when using py2app. However, when the plugin is used I get the following error:
pymongo.errors.ConfigurationError: The ssl module is not available.
There are 2 parts that I don't understand. First, since the ssl module is included with the python package, the module should be added automatically, which doesn't happen. Second, when I'm trying to add ssl to the
install_requires
variable, the module won't be found.Why is this module so hard to include, and how can I include it?

IOError in python for meanwhile slide a query result
I've got an issue with this part of python script. When I run it, it throws this error:
IOError: [Errno 0] Error
in a RANDOM ELEMENT of
dbCur
(result of query).Here is my code:
dbCur.execute("SELECT parent_name FROM crowd.cwd_membership WHERE child_name = \'"+source+"\'") for group in dbCur: code = 999 # Check sul nome visualizzato if ('' in group): print "Aggiungo il gruppo: ", group[0], " all' utente di destinazione..." code = giveGroup(jira,target,group[0]) if (code != 400 or code != 404): # Se andata a buon fine print " ...Gruppo aggiunto con successo \n" if (code == 400 or code == 404): # Se utente o gruppo non trovati, manda in review manuale requireReview = True if (code==999): # Se il nome non è corretto O non esiste, non fa partire la giveGroup print "...Nome gruppo non valido, " autogrant_crowd_clog.logger.info("Errore nel nome del gruppo: "+group[0]+ " l'inserimento è stato saltato.")

all coefficients turn zero in Logistic regression using scikit learn
I am working on logistic regression using scikit learn in python. I have the data file that can be downloaded via the following link.
Below is my code for machine learning part.
from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import roc_auc_score import pandas as pd scaler = StandardScaler() data = pd.read_csv('data.csv') dataX = data.drop('outcome',axis =1).values.astype(float) X = scaler.fit_transform(dataX) dataY = data[['outcome']] Y = dataY.values X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33) lasso = Lasso(alpha=.3) lasso.fit(X_train,y_train) print("MC learning completed") print(lasso.score(X_train,y_train)) print(lasso.score(X_test,y_test)) print(lasso.coef_)
when I print coefficients, it turns out all zero. Can anyone advise me on that?
Let me explain a little bit about my objective. The problem seems to be a classification problem as we can only see 0 or 1 in Ytrain and Ytest. if we put a simple example, 0 can be considered as missed, 1 can be considered as scored. what I am trying to do is to compute the probability scoring for each event when a shot is taken place.
Thanks in advance.
Regards,
Zep

RandomForestClassifier instance not fitted yet. Call 'fit' with appropriate arguments before using this method
I am trying to train a decision tree model, save it, and then reload it when I need it later. However, I keep getting the following error:
This DecisionTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
Here is my code:
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.20, random_state=4) names = ["Decision Tree", "Random Forest", "Neural Net"] classifiers = [ DecisionTreeClassifier(), RandomForestClassifier(), MLPClassifier() ] score = 0 for name, clf in zip(names, classifiers): if name == "Decision Tree": clf = DecisionTreeClassifier(random_state=0) grid_search = GridSearchCV(clf, param_grid=param_grid_DT) grid_search.fit(X_train, y_train_TF) if grid_search.best_score_ > score: score = grid_search.best_score_ best_clf = clf elif name == "Random Forest": clf = RandomForestClassifier(random_state=0) grid_search = GridSearchCV(clf, param_grid_RF) grid_search.fit(X_train, y_train_TF) if grid_search.best_score_ > score: score = grid_search.best_score_ best_clf = clf elif name == "Neural Net": clf = MLPClassifier() clf.fit(X_train, y_train_TF) y_pred = clf.predict(X_test) current_score = accuracy_score(y_test_TF, y_pred) if current_score > score: score = current_score best_clf = clf pkl_filename = "pickle_model.pkl" with open(pkl_filename, 'wb') as file: pickle.dump(best_clf, file) from sklearn.externals import joblib # Save to file in the current working directory joblib_file = "joblib_model.pkl" joblib.dump(best_clf, joblib_file) print("best classifier: ", best_clf, " Accuracy= ", score)
Here is how I load the model and test it:
#First method with open(pkl_filename, 'rb') as h: loaded_model = pickle.load(h) #Second method joblib_model = joblib.load(joblib_file)
As you can see, I have tried two ways of saving it but none has worked.
Here is how I tested:
print(loaded_model.predict(test)) print(joblib_model.predict(test))
You can clearly see that the models are actually fitted and if I try with any other models such as SVM, or Logistic regression the method works just fine.

Prune unnecessary leaves in sklearn DecisionTreeClassifier
I use sklearn.tree.DecisionTreeClassifier to build a decision tree. With the optimal parameter settings, I get a tree that has unnecessary leaves (see example picture below  I do not need probabilities, so the leaf nodes marked with red are a unnecessary split)
Is there any thirdparty library for pruning these unnecessary nodes? Or a code snippet? I could write one, but I can't really imagine that I am the first person with this problem...
Code to replicate:
from sklearn.tree import DecisionTreeClassifier from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target mdl = DecisionTreeClassifier(max_leaf_nodes=8) mdl.fit(X,y)
PS: I have tried multiple keyword searches and am kind of surprised to find nothing  is there really no postpruning in general in sklearn?
PPS: In response to the possible duplicate: While the suggested question might help me when coding the pruning algorithm myself, it answers a different question  I want to get rid of leaves that do not change the final decision, while the other question wants a minimum threshold for splitting nodes.
PPPS: The tree shown is an example to show my problem. I am aware of the fact that the parameter settings to create the tree are suboptimal. I am not asking about optimizing this specific tree, I need to do postpruning to get rid of leaves that might be helpful if one needs class probabilities, but are not helpful if one is only interested in the most likely class.

Create loop/function to remove negative varImp results
I would like to create the loop which is going to model data, get the variable importance, assign the negative values/importance columns and filter them from the data and model it again until there are no negative values. Here below you can see the example code for creating the model and getting variable importance:
library(party) library(caret) model_cforest < cforest(drat~.,data=mtcars,controls=cforest_unbiased()) cforest_var < varImp(model_cforest,conditional=TRUE)
As we can see cforest_var gives us this table:
Overall mpg 0.009778909 cyl 0.033507134 disp 0.056359569 hp 0.000000000 wt 0.044186730 qsec 0.000000000 vs 0.000309504 am 0.050791540 gear 0.060967894 carb 0.000000000
On base of this table i would like then to remove the column vs (which has negative value) and run the
cforest
model again (and if there is again negative value, remove it and run model until there are no negative values).Final result should be a table with the most important variables.
Here is as far as i got:
removeNeg < function(data){ model_cforest < cforest(drat~., mtcars,controls=cforest_unbiased()) cforest_var < varImp(model_cforest,conditional=TRUE) varImp_neg < row.names(cforest_var)[apply(cforest_var, 1, function(u) any(u < 0))] }
but i have feeling that it is wrong direction and i stucked in one place.Thanks for help!

Statsmodel intercept is different to Seaborn lmplot intercept
What could explain the difference in intercepts between statsmodel OLS regression and also seaborn lmplot?
My statsmodel code:
X = mmm_ma[['Xvalue']] Y = mmm_ma['Yvalue'] model2 = sm.OLS(Y,sm.add_constant(X), data=mmm_ma) model_fit = model2.fit() model_fit.summary()
My seaborn lmplot code:
sns.lmplot(x='Xvalue', y='Yvalue', data=mmm_ma)
My statsmodel intercept is 28.9775 and my seaborn lmplot's intercept is around 45.5.
Questions:
 Should the intercepts be the same?
 Why might explain why these are different? (can I change some code to make it equal)
 Is there a way to achieve a plot similar to seaborn lmplot but using the exact regression results to ensure they align?
Thanks
[EDIT  19th July]
@Massoud thanks for posting that. I think I have realised what is the problem. My xvalues range between 1400 to 2600 and yvalues range from 40 to 70. So using seaborn lmplot, it just plots the regression and the intercept is based on the lowest range X value  which is an intercept of 46.
However for statsmodel OLS, it keeps the line going until X = 0, which is why I get an intercept of 28 or so.
So I guess the question is there a way to continue the trend line using seaborn to go all the way until x = 0.
I tried changing the axis but it doesn't seem to extend the line.
axes = lm.axes axes[0,0].set_xlim(0,)

"TypeError: can't pickle NotImplementedType objects" in KerasRegression model
I'm creating a simple regression neural network in Keras. However, when I try to run it as follows
seed = 7 numpy.random.seed(seed) dataset = numpy.loadtxt("instancesFipo.txt", delimiter=", ") #testset = numpy.loadtxt("instancesFipo6.txt", delimiter=", ") Xtrain = dataset[:,0:8] #All rows, first 8 columns Ytrain = dataset[:,8] #All rows, 9th column model = Sequential() #Create layers of neural net model.add(Dense(50, input_dim=8, kernel_initializer='normal', activation='relu')) model.add(Dense(50, kernel_initializer='normal', activation='relu')) model.add(Dense(1, kernel_initializer='normal')) #Create loss function and algorithm model.compile(loss='mean_squared_error', optimizer='adam') estimator = KerasRegressor(build_fn=model, epochs=100, batch_size=10, verbose=0) kfold = KFold(n_splits=10, random_state=seed) results = cross_val_score(estimator, Xtrain, Ytrain, cv=kfold) print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))
I'm getting "TypeError: can't pickle NotImplementedType objects", which is induced by the call to
cross_val_score
. Not sure what's going on. Any help would be appreciated and thanks! 
SciPy curve_fit returns weird fitted curve
I just tried to fit a curve to a bunch of points that looks like a logistic function, and the result is like a tangled curve.
This is the code:
from scipy.optimize import curve_fit def logistic(v, m, n, a, t): return a * (1 + m * np.exp(v/t))/(1 + n * np.exp(v/t)) def power_curve_fit(xvalues, yvalues): xdata = xvalues ydata = yvalues popt, pcov = curve_fit(logistic, xdata, ydata) pc = pd.DataFrame() pc['wind_speed'] = xdata pc['power_gen'] = ydata pc['Fit'] = logistic(xdata, *popt) plt.plot(xdata, logistic(xdata, *popt), 'red') plt.scatter(xdata, ydata, c='pink', marker='o') return pc
Thank you!

CPU utilization while running Scikitlearn logistic regression
While running scikitlearn logistic regression , i only utilize ~14% of the computer CPU , even when adding n_jobs=1 parameter , is there a way to increase the CPU usage / memory usage for faster process ?